Interlayer Link Prediction in Multiplex Social Networks Based on Multiple Types of Consistency between Embedding Vectors
Rui Tang, Zhenxiong Miao, Shuyu Jiang, Xingshu Chen, Haizhou Wang, Wenxian Wang, Mingyong Yin, Wei Wang
IInterlayer Link Prediction in Multiplex Social NetworksBased on Multiple Types of Consistency betweenEmbedding Vectors
Rui Tang , Zhenxiong Miao , Shuyu Jiang , Xingshu Chen , ∗ , HaizhouWang , ∗ , Wenxian Wang , Mingyong Yin , and Wei Wang
1. College of Cybersecurity, Sichuan University, Chengdu 610065, China2. Cybersecurity Research Institute, Sichuan University, Chengdu 610065, China3. College of Computer Science, Sichuan University, Chengdu 610065, China.
Abstract
Online users are typically active on multiple social media networks (SMNs),which constitute a multiplex social network. With improvements in cyberse-curity awareness, users increasingly choose different usernames and providedifferent profiles on different SMNs. Thus, it is becoming increasingly chal-lenging to determine whether given accounts on different SMNs belong tothe same user; this can be expressed as an interlayer link prediction problemin a multiplex network. To address the challenge of predicting interlayerlinks , feature or structure information is leveraged. Existing methods thatuse network embedding techniques to address this problem focus on learninga mapping function to unify all nodes into a common latent representationspace for prediction; positional relationships between unmatched nodes andtheir common matched neighbors (CMNs) are not utilized. Furthermore,the layers are often modeled as unweighted graphs, ignoring the strengthsof the relationships between nodes. To address these limitations, we pro-pose a framework based on multiple types of consistency between embeddingvectors (MulCEV). In MulCEV, the traditional embedding-based method isapplied to obtain the degree of consistency between the vectors represent-ing the unmatched nodes, and a proposed distance consistency index basedon the positions of nodes in each latent space provides additional clues for ∗ Corresponding Author: [email protected], [email protected]
Preprint submitted to Elsevier August 14, 2020 a r X i v : . [ phy s i c s . s o c - ph ] A ug rediction. By associating these two types of consistency, the effective infor-mation in the latent spaces is fully utilized. Additionally, MulCEV modelsthe layers as weighted graphs to obtain better representation. In this way, thehigher the strength of the relationship between nodes, the more similar theirembedding vectors in the latent representation space will be. The resultsof our experiments on several real-world datasets demonstrate that the pro-posed MulCEV framework markedly outperforms current embedding-basedmethods, especially when the number of training iterations is small. Keywords: social media network, interlayer link prediction, networkembedding, multiplex network
1. Introduction
Social media network (SMN) applications have significantly enriched thedaily lives of users and have attracted the attention of many researchers [1,2, 3]. Most modern people leverage them for connection with one another,content sharing, and entertainment. According to a report by the Pew Re-search Center, 72% of U.S. citizens used some type of SMN in 2019 [4]. Thesituation is similar in China, which has 896 million SMN users [5]. To satisfytheir diverse needs and interests, online users often make use of several SMNssimultaneously, for example by recording individual impressions of currentevents on Twitter, sharing photographs on Instagram, and searching for jobinformation on LinkedIn. These various SMNs thus form a multiplex socialnetwork [6, 7, 8, 9, 10], of which each SMN constitutes a layer. Accountsare represented as nodes, and interactions within an SMN are represented asintralayer links in the multiplex network. If two accounts in different SMNsbelong to the same user, an interlayer link exists between the correspondingnodes across different layers. The structure of a multiplex network has asignificant influence on cascades [11], propagation [12], synchronization [13],and games [14].The goal of interlayer link prediction is to leverage feature or structureinformation to determine whether accounts across different SMNs belong tothe same user [15]; this is a challenging task in multiplex network analysis.It is also known as anchor link prediction [16, 17], network alignment [18, 19,20, 21, 22], user identification [23], and user identity linkage [21].As hacking attempts have become more frequent, online users’ securityawareness has gradually increased. An increasing number of users create ac-2 α G β v β v β v β v β v β v β (b) Multiplex network (c) Latent representation spaces v α v α v α v α v α v α v β v β v β v β s α s α s α s α s α s α s β s β s β s β s β s β v α v α v α v α v α v α v β v β ? ? Embedding
Embedding d d d d d d (d) Multiple types of consistency G α G β Unifying × Measuring
KnowUnknow
SMN α SMN β v α v α v α v α v α v α (a) Real Scenario Intralayer link
Interlayer link v β v β v β v β v β v β v β v β v α v α v α v α v α v α s α s α s α s β s β s β s α s α s α s β s β s β v α v α v β v β (i) Euclidean distance (ii) Difference(iii) Number of CMNs Distance consistency
Vector consistency s α - s β s α - s β s α - s β s α - s β … Figure 1: Example of interlayer link prediction based on multiple types of consistencybetween embedding vectors. (a) Real scenario: There are two SMNs, each of which hassix user accounts. The accounts linked by the vertical gray line belong to the same user.The correspondence of the account pairs linked by a solid gray line is known in advance; thetask is to determine the correspondence of the other accounts. (b) Multiplex network: Themultiple SMNs are represented by a multiplex network. The task of determining whetheraccounts on different SMNs belong to the same user becomes the task of predicting theunobserved interlayer links in the multiplex network. (c) Latent representation spaces:Network embedding techniques are used to address the interlayer link prediction problem.Each layer of the multiplex network is embedded into a latent representation space. (d)Multiple types of consistency: The degree of match to estimate whether an interlayer linkexists between two unmatched nodes across different layers consists of two parts. Oneis the degree of vector consistency (upper part), and the other is the degree of distanceconsistency (lower part). counts under different usernames, hide profile information, or even providefake content on profile configuration pages [24]. In this way, users accessSMNs anonymously to make friends, share information, and discuss prob-lems, thereby not only using multiple SMNs simultaneously but also pro-tecting their privacy. However, such anonymity can pose a certain degreeof harm to society. Cybercriminals register a large number of accounts onmultiple SMNs and subsequently engage in various types of illegal activities.For example, they might circulate messages containing untruths, spread mal-ware program links, or initiate financial fraud on these SMNs [25]. Predictingthe interlayer links of the multiplex network comprising different SMNs canhelp criminal investigation authorities establish cybercriminals’ patterns oflaw violations, model their online behaviors, identify their regions of activ-ity, or even determine their real-world identities, thereby effectively fightingcybercrime. 3here are other benefits as well to predicting the interlayer links in amultiplex network. For instance, because information and rumors typicallyspread across multiple SMNs, predicting interlayer links can help improve theunderstanding of information diffusion [26]. Furthermore, the use of SMNs asevidence in trials related to issues of custody, divorce, and insurance is risingrapidly [27]. A method for identifying the interlayer links across multipleSMNs would be a powerful tool in the collection of evidence for civil andcriminal investigations.The problem of predicting interlayer links in a multiplex network is typ-ically solved by leveraging feature or structure information accessed fromthe multiple SMNs. Currently, there are three main approaches for handlingthis problem: (i) feature-based prediction, (ii) network-based prediction, and(iii) a combination of multiple approaches. Of these, network-based methodshave attracted more attention; they show increasing promise because fewpeople share the same circle of friends [28], and information on connectionsin SMNs is quite easy to obtain [29].In recent years, network embedding techniques have been utilized to learnlatent, low-dimensional representations of network nodes while preservingnetwork structure. After all of the nodes are represented as low-dimensionalvectors into the latent representation space, advanced network analytic taskssuch as node classification, community detection, and link prediction can beefficiently carried out. [30]. Motivated by the advances in network embed-ding techniques for single-network tasks, researchers have proposed severalstrategies to leverage these techniques for solving the interlayer link predic-tion problem [31, 32, 33]. Typically in these studies, network embeddingtechniques are used first, to learn the latent representations of nodes in dif-ferent layers of the multiplex network. After that, a priori interlayer nodepairs are constrained to have the same latent representations to unify nodesinto a common latent representation space. Finally, the unobserved inter-layer links are predicted by comparing the embedding vectors of unmatchednodes across different layers in the common latent space.To unify all nodes into a common latent space and predict the unob-served interlayer links, most embedding-based methods utilize a priori in-terlayer links to train an approximate mapping function after achieving thelatent representations, as was done in Refs. [34, 31, 33, 35]. However, theperfect mapping function is difficult to obtain, as each layer’s latent spaceis unknown to the others [33], and this leads to unsatisfactory performance,especially when the number of training iterations is small. Apart from pre-4icting the unobserved interlayer links by comparing the embedding vectorsin the common latent space unified by the mapping function, the positionalrelationship between unmatched nodes and their common matched neighbors(CMNs) can also be used to measure whether an interlayer link exists be-tween two unmatched nodes that lie in different layers. In other words, theeffective information in the latent representation spaces is not fully utilized.Figure 1 illustrates the general components of interlayer link predictionbased on multiple types of consistency between embedding vectors. As shownin the top half of Fig. 1(d), if we use only the information in the commonlatent space unified by the mapping function, the node v α will be matchedwith v β as their embedding vectors are more consistent. However, if we an-alyze the positional relationship between unmatched nodes and their CMNssimultaneously, we might uncover more clues for predicting the unobservedinterlayer links. We propose a “distance consistency” index to measure thisrelationship. As shown in the lower half of Fig. 1(d), three aspects are con-sidered in the distance consistency index: (i) the Euclidean distance betweenthe unmatched node and its matched neighbor, (ii) the difference betweentwo Euclidean distances formed by unmatched nodes across different layersand their CMNs, and (iii) the number of CMNs.When each layer of the multiplex network is embedded into a latent repre-sentation space, different layers are often modeled as unweighted graphs, andthe strength of the relationships between nodes is often ignored. However,the intralayer links between nodes may have different relationship strengths.For example, if a boy has only one friend, the friendship between him andhis friend is highly likely to be closer than one between individuals who havemany friends. To distinguish among relationship strengths, the intralayerlinks between nodes should be weighted. To address this problem, we pro-pose a weighted-embedding method to embed each layer of the multiplexnetwork in the form of weighted graphs based solely on the network struc-ture.In this study, we developed a framework based on multiple types of con-sistency between embedding vectors (MulCEV) for interlayer link predictionin a multiplex network; it focuses on making full use of information in thelatent representation spaces for prediction and on modeling different layersas weighted graphs to obtain better representation. The main contributionscan be summarized as follows: • We propose a distance consistency index that is based on the positions5f nodes in each latent representation space, which leverages CMNs ofthe unmatched nodes across different layers as references to provideadditional clues for predicting interlayer links. The degree of matchto estimate whether an interlayer link exists between two unmatchednodes across different layers consists of two parts: the degree of vectorconsistency, which applies the traditional embedding-based method tomeasure the consistency of the embedding vectors of the unmatchednodes in the common latent representation space, and the degree ofdistance consistency proposed above. The effective information in thelatent representation spaces is fully utilized by associating these twotypes of consistency between embedding vectors. • We model each layer of the multiplex network as a weighted graphto obtain better representation based solely on the network structure.Thus, the higher the strength of the relationship between nodes, thecloser are their embedding vectors in the latent representation space. • In order to reduce the time complexity, we adopt the technique ofmatrix multiplication to optimize the process of calculating the distanceconsistency and vector consistency for all unmatched node pairs. • We test the effectiveness of the proposed MulCEV framework on twowidely used real-world multiplex networks and report the results againstthose of state-of-the-art methods.The rest of this paper is organized as follows. Section II systematicallyreviews related work. Section III presents preliminaries and problem defi-nitions. Section IV describes the MulCEV framework in detail. Section Vpresents details of the datasets used in the experiments, comparison meth-ods, and the experimental results. Finally, Section VI concludes the paperand proposes areas for future research.
2. Related Work
The problem of interlayer link prediction in the multiplex network is typ-ically solved by leveraging feature or structure information accessed fromthe multiple SMNs [3]. Early studies focused on feature information; theyanalyzed user profiles, location trajectories, and user-generated content tolink nodes across different SMNs belonging to the same user. Profile features6ncluded username, image, position, birthday, job, and experience, amongothers [36]. The authors of Refs. [37, 38, 39, 40, 41] explored ways of usingusernames for prediction. References [42, 43, 44, 45, 46, 34] considered var-ious profile attributes to improve prediction performance. With the rapiddevelopment of SMNs, many users began using different usernames in dif-ferent SMNs for security reasons. Meanwhile, the accessible profile infor-mation among SMNs became increasingly fragmented, unavailable, and dis-ruptive [29]. These SMN characteristics marginalized the traditional profile-based resolutions. The trajectory-based method has been popular since theemergence of the mobile-phone-based Internet. SMN users who wish to an-nounce their location to their friends on some SMN applications can tap a“check-in” button to see a list of nearby places and choose the place thatmatches their location. References [47, 48, 49] focused on these check-indata and used them to link identities. Such trajectory-based methods, how-ever, often face data sparsity problems, and users usually share differentlocations on different SMNs. User-generated content can reveal some uniquecharacteristics of an SMN user, such as his or her writing style [50, 51] orfootprint [52]. These methods rely heavily on the availability of excellentnatural language processing (NLP) techniques and text preprocessing algo-rithms because user-generated content often includes spoken words, emotionicons, and abbreviations.Data on a network’s structure are highly accessible and difficult to coun-terfeit. In addition, a user’s friend circle is highly personalized; i.e., fewpeople share the same friend circle [28]. Therefore, network-based methodsare an ideal solution for the interlayer link prediction problem and have at-tracted the interest of an increasing number of researchers in recent years.Network-based methods can be divided into non-embedding-based methodsand embedding-based methods according to whether network embeddingtechniques are used.
Given the completeness and connectivity of a network structure, two kindsof structural information can be used to solve the interlayer link predictionproblem. The first is local network information, which focuses on the one-hopneighborhood (e.g., follower/followee/friend relationships) of the unmatchednodes [3]. Narayanan and Shmatikov proposed a re-identification algorithm,which was the first method to use a graph-theoretic model based on the nodeneighborhood to solve this problem [53]. Later, Korula et al. [54] computed7 similarity score for an unmatched node pair by counting the number ofCMNs and then keeping all the links above a specific threshold. To avoidthe possible problem of mismatching low-degree nodes in the early phases,only nodes whose degree is higher than a specified threshold are allowedto be matched. Zhou et al. [28] proposed a friend-relationship-based useridentification (FRUI) algorithm that counts the number of shared friendsto calculate the degree of match for all candidate-matched node pairs andchooses pairs that have the maximum value as the final set of matched pairs.Tang et al. [15] further investigated the importance of the scale-free propertyof real-world SMNs for accomplishing interlayer link prediction and proposeda degree penalty principle to calculate the degree of match of all unmatchednode pairs. Ren et al. [55] defined a set of meta-diagrams for feature extrac-tion and used greedy link selection for the interlayer link prediction.The second type of structural information is global network information.Zhu et al. [56] transformed the interlayer link prediction problem into a max-imum common subgraph problem and maximized the number of intralayerlinks to obtain a cross-layer mapping. Zafarani and Liu [23] also explored asolution utilizing global network information. They calculated the Laplacianmatrices for each layer and used a matrix optimization method to performthe prediction.The above methods were used mainly for dealing with the two-layer case,although most researchers suggested that their approaches could be extendedto a three- or four-layer multiplex network. Other researchers have studiedthis case. The authors of Ref. [57] considered both local and global con-sistency to match nodes across more than two layers: local consistency formatching nodes across just two layers, and global consistency for dealing withthe cases involving more than two layers. In Ref. [58], an algorithm is pro-posed to resolve the one-to-one constraint in the situation of cross-multiplelayers. This algorithm matches nodes by minimizing the friendship incon-sistency and selects the node pairs; it can lead to the maximum confidencescores for the multiplex network.Other scholars have studied the interlayer link prediction problem froma different perspective. Zhang et al. [59] proposed a joint link fusion algo-rithm to predict the intralayer links and interlayer links simultaneously. InRefs. [60, 61], the authors studied ways of predicting interlayer links in theabsence of a priori interlayer links. 8 .2. Embedding-based Methods
Network embedding techniques aim to represent nodes in a network bylow-dimensional vectors in a latent representation space so that advancedanalytic tasks, such as node classification, community detection, and linkprediction, can be conducted more efficiently in both time and space [62].DeepWalk [63] leverages uniform random walks to generate a set of nodesequences that are similar to the word sequences in natural language and usesthe skip-gram model to learn vertex representations of nodes. Node2vec [64]demonstrated that DeepWalk is not sufficiently expressive to capture a moreglobal structure and incorporated a biased random walk strategy to improveit. Tang et al. [65] proposed a large-scale information network embedding(LINE) approach to preserve both the first- and second-order proximitiesbetween nodes. Subsequently, Wang et al. [66] preserved not only the first-and second-order proximity of vertices but also the community structure.There have been numerous other studies using network embedding techniquessuch as dynamic network embedding [67, 68] and embedding for scale-freenetworks [69].Increasingly, computer and network scientists are exploring ways of em-ploying network embedding techniques to improve their ability to predictinterlayer links in terms of accuracy, applicability, and efficiency. Tan etal. [70] tried to map accounts across SMNs based on network embedding,adopting hypergraphing to model high-order relations of SMNs and repre-senting nodes in a common latent space. Their method infers correspondenceby comparing distances between the vectors of the unmatched nodes. Liuet al. [32] represented the multiple SMNs with a shared latent space anddetermined the interlayer links by computing the cosine similarity betweenthe latent space vector of one node in layer α and another in layer β . Thenetwork embedding process was integrated with the entity alignment processunder a unified optimization framework. They further refined their proposedmethod in Ref. [71] by incorporating structural diversity. The structural di-versity focuses on the impact of whether the a priori matched nodes comefrom different communities.Instead of embedding all layers into a common latent space, Man et al. [31]projected each SMN into a unique latent space and represented nodes bylow-dimensional vectors in the latent space. Then, they trained a cross-layermapping function for predicting interlayer links. Zhou et al. [33] adoptedthe same idea and proposed a semi-supervised approach that leverages duallearning to pretrain the mapping function to improve prediction accuracy.9hey focused on learning latent semantics of both the node representationand the network structure [22]. Zhou et al. [72] studied the scenario withouta priori interlayer links and proposed an unsupervised approach for the pre-diction. Considering time complexity, Wang et al. [35] proposed a frameworkthat directly learns a binary hash code for each node across SMNs, whichobtained high time efficiency while maintaining competitive prediction accu-racy. In Ref. [17], the authors adopted active learning to reduce the cost oflabeling a priori node pairs. In Ref. [20], the authors viewed all of the nodesin one layer as a whole and executed the prediction at the distribution level.Chu et al. [19] considered multi-layer scenarios in which the number of layersis more than two. They refined two types of low-dimensional vectors for eachnode: an inter-vector for interlayer link prediction, and an intra-vector fordownstream network analysis tasks.In some studies, structural information and attribute information wereembedded simultaneously to perform the interlayer link prediction. Wang etal. [73] proposed a linked heterogeneous network embedding (LHNE) methodto fuse the content and structural information of a user into a unified latentrepresentation space to identify account linkages. In Ref. [74], the authorsproposed a semi-supervised network embedding method to learn the attributeinformation and structural information simultaneously. Heterogeneous SMNsdiffer substantially in several aspects, including network structure, user be-havior, and user information. TransLink [21] captures the heterogeneities ofSMNs and embeds both nodes and their various types of interactions into aunified latent space.
3. Preliminaries and Problem Statement
Table 1 displays the main symbols and notation used in this paper. Wefollow the common symbolic conventions, wherein bold uppercase letters de-note matrices, bold lowercase letters denote column vectors, and lowercaseletters denote scalars.
In general, an SMN can be represented as a graph G ( V, E ), where V isa node set representing all the accounts in the SMN, and E ⊆ V × V is anedge set representing the relationships among the accounts. Multiple SMNscan constitute a multiplex network. 10 able 1: Symbols and notations Symbol Description M The multiplex network. G A SMN which is one layer of M . u, v Nodes in M . u , v Embedding vectors of nodes u and v respec-tively. α, β Layer indices of M . e α , e β Intralayer links in G α and G β respectively. e , E Intralayer links vector and intralayer link ma-trix respectively. e αβ Interlayer link. i, j, a, b
Node indices. n α , n β Number of nodes in G α and G β . n, m Number of a priori interlayer links and un-observed interlayer links respectively.Γ( v i ) Set of neighbors of node v i . k v Degree of node vp, P Degree of vector consistency and Degree ofvector consistency matrix respectively. q, Q Degree of distance consistency and Degree ofdistance consistency matrix respectively. r, R Degree of match and Degree of match matrixrespectively. d Dimensionality of the latent representationspace. φ Mapping function. w weight of the intralayer link. δ control parameter.Φ Set of a priori interlayer links.Ψ Set of unobserved interlayer links.11 efinition 1: Multiplex network. Given a set of SMNs, we can de-note them using superscripts α, β, . . . , such as by G α ( V α , E α ), G β ( V β , E β ) , . . . .These multiple SMNs can be seen as a pair M = ( g, c ), where g = { G α | α ∈ { , . . . , m }} is a family of graphs denoting the different SMNs and c = { E αβ ⊆ V α × V β | α, β ∈ { , . . . , m } , α (cid:54) = β } (1)is the set of interconnections between the nodes of G α and G β , where α (cid:54) = β . Each element in g is referred to as a layer in multiplex network M .The elements of E α are referred to as intralayer links, and the elements of E αβ are called interlayer links. The interlayer links are also called interlayernode pairs, and the nodes belonging to these pairs can be called interlayernodes. The interlayer links that are given in advance are called a prioriinterlayer links or a priori interlayer node pairs, and the other interlayerlinks are called unobserved interlayer links. The nodes belonging to the apriori interlayer node pairs are called matched nodes, and the other nodesare called unmatched nodes. A node pair consisting of two unmatched nodesacross different layers can be called an unmatched node pair. Definition 2: Common matched neighbor (CMN).
Given an apriori interlayer node pair ( v αi , v βj ), a node v αa in layer α , and a node v βb inlayer β , if an intralayer link exists between v αa and v αi and another interlayerlink exists between v βb and v βj , we can say that the a priori interlayer nodepair ( v αi , v βj ) is the CMN of nodes v αa and v βb . Definition 3: Network embedding model.
Given a layer G α ( V α , E α )of the multiplex network M , a network embedding model learns to representeach node v αi ∈ V α as a low-dimensional vector v αi ∈ R d , where d representsthe dimensionality of the latent representation space. Definition 4: Mapping function.
In the method proposed in thispaper, each layer is embedded into a single latent representation space. Givena set of interlayer links, the function φ is defined as a mapping from layer α tolayer β such that for each interlayer node pair ( v αi , v βj ), we have φ ( v αi ) = v βj .Generally, the perfect mapping function is hard to obtain, as each layer’slatent space is unknown to the others [33]. Most embedding-based methodsutilize a priori interlayer links to train an approximate mapping functionafter achieving the latent representations. After obtaining the approximatemapping function, the unmatched nodes can be unified in a common latentspace by this function. 12 iii) Calculation of degree of match(i) Cross layer extension Original multiplex network v α v α v α v α v α v α v β v β v β v β v β v β ? ? v α v β ? G α G β (ii) Network embedding v α v α v α v α v α v α v β v β v β v β v β v β v α v β G α G β v α v α v α v α v α v α v β v β v β v β d d d d G α G β v β v β v α v β …… …… … … d … d MLP v α v α v α v α d d G α v β v β v β v β d d G β …… …… … … d … d MLP v α v α d d G α v α v α v α v β v β v β v α G β d d v α v α v α v β v β v β v α v α v α v α v α v α v β v β v β v β s α s α s α s α s α s α s β s β s β s β s β s β d d d d G α G β v β v β v α v β v α v α v α v β v β v β v α v α v α v β v β v β s β s α Training the mapping function ϕ Unifying the unmatched nodes into a common latent representation space by the mapping functionInput
Target output
Matrix for the degree of vector consistency
Matrix for the degree of distance consistencyMatrix for the degree of match
Associating v α v α v α v α v α v α v β v β v β v β v β v β v α v β G α G β w Computing relationship strength for all intralayer links
Matched nodes embedding Latent representation spaces
Unmatched nodes
Matched nodes
Unmatched nodes embedding Computing the Euclidean distances of unmatched nodes and their CMNs Calculating the degree of distance consistency
Calculating the degree of vector consistency v α v α v α v β v β v β (iv) Prediction Sorting by row
Predicting v α v α v β v β …… Degree of distance consistencyDegree of vector consistency
Matrix for the degree of vector consistencyMatrix for the degree of distance consistency
Figure 2: Overview of MulCEV. .2. Problem Statement Supposing that we have a multiplex network M with a set of a prioriinterlayer links, the interlayer link prediction problem is to determine whetherany unmatched node pair v αi , v βj chosen from V α and V β have an interlayerlink, i.e., whether the accounts represented by the two unmatched nodesbelong to the same person.Given an unmatched node pair ( u αi , u βj ) across different layers in themultiplex network M , interlayer link prediction learns a binary function f : V α × V β → , f ( u αi , u βj ) = (cid:26) , if e αβij exist0 , otherwise , (2)where f ( u αi , u βj ) = 1 means that there is an interlayer link between unmatchednodes v αi and v βj .It is worth noting that some people may register two or more accountsin a given SMN. For simplicity, we assume that these accounts belong todifferent individuals.
4. Proposed Framework
The proposed framework (shown in Fig. 2) is an algorithm consisting offour main components: (i) cross-layer extension, (ii) network embedding, (iii)calculation of the degree of match, and (iv) prediction. We discuss each onein detail in the following sections.
Given two nodes with an interlayer link in the multiplex network, it isusually true that they have an intralayer link in one layer if there exists aconnection in the other layer [75]. The cross-layer extension is to leveragea priori interlayer links to extend the intralayer links in each layer of themultiplex network, as shown in Fig. 2 (i).Given a multiplex network M with two layers G α and G β , a priori inter-layer link set Φ, and intralayer link sets E α and E β , the extended network (cid:101) G α of layer α can be described as (cid:101) E α = E α ∪ { ( v αi , v αj ) | ( v αi , v βa ) ∈ Φ , ( v αj , v βb ) ∈ Φ , ( v βa , v βb ) ∈ E β } , (3)14eferring to Ref. [31]. The extended network (cid:101) G β of layer β is similar tothe above equation. Note that it is not essential to perform cross-networkextension. During network embedding, nodes that are “close” to each other in thenetwork are embedded in such a way that they have similar vector represen-tations [76]. How is it determined whether two nodes are “close”? Variousscholars have proposed different methods. DeepWalk [63] leverages a trun-cated random walk to generate a set of node sequences for learning the rep-resentations. This method considers nodes with intralayer links to be close.LINE [65] uses the notion of first- and second-order proximities to measurecloseness, where first-order proximity refers to intralayer links and second-order proximity refers to two nodes having common neighbors. These embed-ding methods often model different layers as unweighted graphs. However,different intralayer node pairs may have different relationship strengths. Forexample, if a boy has only one friend, the friendship between him and his com-panion is highly likely to be closer than that between those who have manyfriends. To distinguish between relationship strengths, the intralayer connec-tion between nodes should be weighted. We propose a weighted-embeddingmethod that embeds each layer of the multiplex network in the form ofweighted graphs based purely on the network structure.The strength of the relationship between two nodes in the same layer canusually be characterized by the number of neighbors they have in common.However, the number of intralayer links a node has also affects the strength ofits relationship with other nodes, and the degrees of their common neighborsmay also affect relationship strength. Considering these three factors, wepropose the following formula: w ij = (( (cid:88) z ∈ Γ( v i ) ∩ Γ( v j ) k z ) · | Γ( v i ) ∩ Γ( v j ) || Γ( v i ) ∪ Γ( v j ) | + 1) · e ij , (4)where k z is the degree of node z , and Γ( · ) represents the neighbor set of nodeinside it.Using the above formula has the following three advantages: (i) Thegreater the number of common neighbors between two nodes, the greatertheir weight; (ii) when two pairs of nodes have the same number of commonneighbors, the node pair with fewer intralayer links will have the higher15eight; and (iii) the smaller the degree of the common neighbors betweentwo nodes, the greater their weight.After obtaining the weight of each intralayer link, we reference the net-work embedding model LINE [31] to update the node representation. Forany intralayer link e αij = ( v αi , v αj ) in a given layer α , the joint probabilitybetween node v αi and v αj is z ( v αi , v αj ) = 11 + exp( − ( v αi ) T · v αj ) , (5)where v αi and v αj are the low-dimensional vectors of nodes v αi and v αj , respec-tively, which are defined in R d ; z ( · , · ) is a distribution over the space V α × V α ,and ( · ) T is the transposition function. The empirical counterpart of z ( · , · )can be defined as (cid:98) z ( · , · ) = w αij /W , where w αij is the weight of the intralayerlink e αij as calculated by Eq. (4), and W is the summation of the weights ofall intralayer links. By minimizing the KL-divergence [77] of z ( · , · ) and itsempirical counterpart (cid:98) z ( · , · ) over all the intralayer links in the α layer, theLINE model can be inferred. The objective function for embedding is O = − (cid:88) ∀ ( u αi ,u αj ) ∈ E α KL( (cid:98) z ( u αi , u αj ) , z ( u αi , u αj )) , (6)where the KL-divergence KL ( · , · ) is a method of measuring the similarityof two distributions. By omitting some constants, the objective function forembedding can be rewritten as O = − (cid:88) ∀ ( u αi ,u αj ) ∈ E α w αij log( z ( u αi , u αj )) . (7)By minimizing Eq. (7) over all the intralayer links independently, each of thenodes in the given layer α can be represented as a d -dimensional vector in thelatent representation space with the stochastic gradient descent algorithm.The layer β of the multiplex network can be embedded by following the samesteps. For any two unmatched nodes across different layers, we calculate a scoreaccording to the MulCEV framework to estimate whether an interlayer linkexists between them. We call this score the degree of match, which consistsof two parts: the degree of vector consistency, and the degree of distanceconsistency. The details are as follows.16 … …… …
Dimension 1
Dimension 2 … Dimension d Dimension 1
Dimension 2 … Dimension d Input layer Hidden layer Output layer
Figure 3: Structure of the multi-layer perceptrons (MLP). We used up to three hiddenlayers and up to 1200 neurons per layer. The numbers of neurons in the input layer andoutput layer depend on the dimensionality of the latent representation space.
We leverage the feed-forward multi-layer perceptrons (MLP) [78] to learnthe mapping function from one layer to another based on the a priori inter-layer links. The structure of the MLP used in MulCEV is shown in Fig. 3.Given each of the a priori interlayer node pairs ( v αi , v βj ) ∈ E αβ and theircorresponding embedding vectors ( v αi , v βj ), we use v αi as the input and v βj asthe target output to train the mapping function φ . The loss function of theMLP is l ( v αi , v βj ) = 1 − cos( φ ( v αi ) , v βj ) , (8)where cos( · , · ) is the cosine similarity, and φ ( v αi ) is the actual output mappedby the MLP. The value of the loss function ranges from 0 to 2. Suppose thatwe have n a priori interlayer links; then for all a priori interlayer nodes, weuse A α ∈ R d · n and A β ∈ R d · n to represent their respective embedding vectormatrices. The goal of training the MLP is to minimize the following costfunction: L ( A α , A β ) = 1 − cos( φ ( A α ) , A β ; Θ) , (9)where Θ is the collection of all parameters in the mapping function φ .To obtain the degree of vector consistency, for any given unmatched nodepair ( u αa , u βb ) with their embedding vectors u αa and u βb , we map node u αa into the latent representation space of the β layer according to the mappingfunction φ ( u αa ). We then use cosine similarity to compute the degree of vector17onsistency between φ ( u αa ) and u βb . The formula can be represented as p ( u αa , u βb ) = φ ( u αa ) T · u βb || φ ( u αa ) || · || u βb || , (10)where || · || represents the 2-norm of the vector within. As shown in Fig. 1(d), if we consider only the degree of vector consistency,it may be difficult to obtain good prediction results in some cases, suchas the incorrect match of ( v α , v β ). The reason is that a perfect mappingfunction is difficult to obtain [33]. If we consider the positional relationshipbetween the unmatched nodes and their matched neighbors in the embeddingspaces of different layers, we might uncover additional clues for predictingthe unobserved interlayer links. We propose a “distance consistency” indexto measure this relationship, defined as q ( u αa , u βb ) = (cid:88) ∀ ( v αi ,v βj ) ∈ Φ ,v αi ∈ Γ( u αa ) ,v βj ∈ Γ( u βb ) exp( − ( s αai · | s αai − s βbj | · s βbj )) . (11)In Eq. (11), Φ represents the set of a priori interlayer links, and s αai is theEuclidean distance between unmatched node u αa and matched node v αi . Theconstraints in the equation indicate that the interlayer node pair ( v αi , v βj )is the CMN of unmatched nodes u αa and u βb . Suppose that matched nodepair ( v αi , v βj ) is the CMN of the unmatched nodes u αa and u βb ; then | s αai − s βbj | can measure the degree of similarity between s αai and s βbj . If the value of | s αai − s βbj | is close to 0, the Euclidean distances s αai and s βbj are deemed tobe consistent; otherwise, s αai and s βbj are deemed to be inconsistent. Using s αai and s βbj to multiply | s αai − s βbj | distinguishes the influence of the CMNson the degree of distance consistency. The closer the Euclidean distance ofan unmatched node and its CMN, the greater the influence of this CMN.( s αai · | s αai − s βbj | · s βbj ) can be transformed by the sigmoid function to ensurethat its value is between 0 and 1. In addition, for an unmatched node pair,the larger the value of | s αai − s βbj | , the smaller the distance consistency shouldbe, which is reflected by the exponential function exp( − ( · )) in the formula.18n summary, the degree of distance consistency has the following charac-teristics:(i) the greater the number of CMNs, the greater the degree of distanceconsistency;(ii) the smaller the Euclidean distance between an unmatched node andits CMN, the greater the influence of this CMN on the degree of distanceconsistency;(iii) the smaller the difference between two Euclidean distances formedby unmatched nodes across different layers and their CMNs, the larger thedegree of distance consistency. After obtaining the degrees of vector consistency and distance consistencyfor unmatched node pair ( u αa , u βb ), we associate these two types of consistencyto calculate the final degree of match. The formula can be represented as r ( u αa , u βb ) = δ · p ( u αa , u βb ) + (1 − δ ) · q ( u αa , u βb ) , (12)where δ is a control parameter that takes a value from 0 to 1. If δ = 0, thedegree of match is only related to the distance consistency, whereas if δ = 1,the degree of match is only related to the vector consistency.For any node u αa in layer α , we can calculate its degree of match withall unmatched nodes in layer β . We can then predict an interlayer linkby identifying the counterpart node in layer β that has the highest degree ofmatch with node u αa or offer a list of nodes in layer β as potential counterpartsof node u αa . To reduce the time complexity, we optimized the calculation of the degreesof vector consistency and distance consistency.
The degree of vector consistency for each unmatched node pair can becalculated using Eq. (10). However, many calculations are repeated, suchas that of || u βb || . We propose an approach based on a matrix operation toreduce the computational time complexity.For all unmatched nodes, denoting B α = [ u α , u α , . . . , u αn α − n ], B β =[ u β , u β , . . . , u βn β − n ], φ ( B α ) = [ φ ( u α ) , φ ( u α ) , . . . , φ ( u αn α − n )], b α = [ || u α || , || u α || , . . . , || u αn α − n || ] T ,19 β = [ || u β || , || u β || , . . . , || u βn β − n || ] T , and φ ( b α ) = [ || φ ( u α ) || , || φ ( u α ) || , . . . , || φ ( u αn α − n ) || ] T ,the degree of vector consistency for all unmatched node pairs can be expressedas P = φ ( B α ) T · B β φ ( b α ) T · b β . (13) If h ai − bj is denoted by exp( − ( s αai ·| s αai − s βbj |· s βbj )), it is clear that if interlayernode pair ( v αi , v βj ) is the CMN of unmatched node pair ( u αa , u βb ), e αai and e βbj will be equal to 1; thus, e αai · h ai − bj · e βbj = h ai − bj . In contrast, if interlayernode pair ( v αi , v βj ) is not the CMN of unmatched node pair ( u αa , u βb ), e αai or e βbj will be equal to 0; thus, e αai · h ai − bj · e βbj = 0. Therefore, Eq. (11) can berewritten as q ( u αa , u βb ) = (cid:88) ∀ ( v αi ,v βj ) ∈ Φ e αai · h ai − bj · e βbj . (14)If node v αi is an a priori interlayer node in layer α , a counterpart nodemust exist in layer β , and vice versa. Based on this, we can make the apriori interlayer nodes uniform, as follows: ( v α , v β ) , . . . , ( v αi , v βi ) , . . . , ( v αn , v βn ). Therefore, Eq. (14) can be replaced with q ( u αa , u βb ) = n (cid:88) i =1 e αai · h ai − bj · e βbi . (15)Using the vector form, Eq. (15) can be replaced by q ( u αa , u βb ) = [ e αa , . . . , e αai , . . . , e αan ] · h a − b · e βb ... h ai − bi · e βbi ... h an − bn · e βbn . (16)We can use the Hadamard product to rewrite [ h a − b · e βb , . . . , h ai − bi · e βbi , . . . , h an − bn · e βbn ] T as [ h a − b , . . . , h ai − bi , . . . , h an − bn ] T ◦ [ e βb , . . . , e βbi , . . . , e βbn ] T .By denoting h ab = [ h a − b , . . . , h ai − bi , . . . , h an − bn ] T , e αa = [ e αa , . . . , e αai , . . . , e αan ] T , e βb = [ e βb , . . . , e βbi , . . . , e βbn ] T , Eq. (16) can be rewritten as q ( u αa , u βb ) = ( e αa ) T · ( h ab ◦ e βb ) . (17)20y using Eq. (17), the degree of distance consistency for unmatched nodepair ( u αa , u βb ) can be represented in vector operation form. Then, if we wantto obtain the degree of distance consistency between node u αa and all theunmatched nodes in layer β , we can express Eq. (17) in matrix operationform. In a similar manner, we denote H = [ h a , . . . , h ab , . . . , h an β ], E β =[ e β , . . . , e βb , . . . , e βn β ]. The degree of distance consistency between node u αa and all the unmatched nodes in layer β can be calculated as follows: q αa = (( e αa ) T · ( H ◦ E β )) T . (18)By denoting s αa = [ s αa , . . . , s αai , . . . , s αan ], s βb = [ s βb , . . . , s βbi , . . . , s βbn ] T , S β =[ s β , . . . , s βb , . . . , s βn β ] T , matrix H can be calculated as H = exp { ( i · s αa ) ◦ | i · s αa − S β | ◦ S β } , (19)where i is a column vector with n β elements. The value of each element invector i is 1.By joining the vector for the degree of distance consistency for all un-matched nodes in layer α , we can obtain the matrix for the degree of dis-tance consistency for all unmatched node pairs, which can be represented as Q = [ q α , . . . , q αa , . . . , q αn α ] T . Table 2: Statistics of Datasets. | V | and | E | are the number of nodes and intralayer linksrespectively. k max is the maximum degree, (cid:104) k (cid:105) is the average degree, r is the degree-degreecorrelation, c is the clustering coefficient, H is the degree heterogeneity, as H = (cid:104) k (cid:105) // (cid:104) k (cid:105) ,and | E αβ | is the number of interlayer links. Network | V | | E | k max (cid:104) k (cid:105) r c H | E αβ | Foursquare 5,313 76,972 552 20.42 − .
193 0.23 3.446 3,148Twitter 5,120 164,920 1725 51.01 − .
214 0.30 4.489DBLP DataMining 11,526 47,326 117 36.68 0 .
110 0.85 2.176 1,295DBLP MachineLearning 12,311 43,948 552 20.42 − .
193 0.23 3.446Finally, the matrix for the degree of match for all unmatched node pairscan be obtained by R = δ · P + (1 − δ ) · Q . (20)The interlayer link prediction results are obtained by ranking each row orcolumn of R in reverse order according to the degree of match.21 . Experiments In this section, we first describe the datasets, baselines, and evaluationmetrics. We then compare the proposed framework with baseline methodson two real-world datasets.
To evaluate the performance of our proposed framework and baselinemethods, we used the following two real-world multiplex network datasets inour experiments (cf. Table 2): • Foursquare–Twitter (FT) : This dataset was collected from Foursquareand Twitter by Zhang et al. [59]. The ground truth for this datasetis provided in Foursquare’s profiles, and the nodes of the two socialnetworks are partially aligned. • DBLP DataMining-DBLP MachineLearning (DBLP) : This datasetwas collected from the Citation Network Dataset [79] and processed byLiu et al. [71]. It is a co-authored multiplex network, one layer of whichconsists of researchers who published articles in journals or conferenceproceedings related to data mining, and the other layer containing re-searchers who published articles in journals or conference proceedingsrelated to machine learning. The ground truth was obtained by col-lecting the authors who published articles in both fields. We used several state-of-the-arts as baselines, which are as follows. • DeepLink [33] : DeepLink is a semi-supervised learning algorithm thatleverages traditional random walks to generate social sequences for thenetwork embedding and utilizes the duality of mapping to improve theprediction performance. • IONE [32] : Input–output network embedding (IONE) projects multi-ple social networks into a common embedded space and matches same-user accounts by calculating the cosine similarity between the vectorsof two nodes. In IONE, it represents each account by three vectors: anode vector, an input context vector, and an output context vector. ONE [32] : This method is a simplified version of IONE. In thismethod, an account is represented by two vectors: a node vector andan output context vector. • IONE-D [71] : This method is a refined version of IONE that exploresthe community structure of the SMNs and incorporates the structuraldiversity to characterize a set of interlayer links. • BootEA [80] : This is a bootstrapping approach that aligns the en-tities of different knowledge graphs based on network embedding. Ititeratively labels potential entity pairs as training data to overcome thelack of a sufficiently large training set and leverages an editing methodto reduce error accumulation during the iterations. • PALE [31] : This method projects each SMN into a unique low-dimensional space and represents nodes by low-dimensional vectors ina latent space. Then, it learns a cross-layer mapping function for pre-dicting interlayer links. • MAH [70] : Manifold alignment on hypergraph (MAH) tries to mapcommon users across SMNs based on the network embedding method.It adopts a hypergraph to model high-order relations of SMNs andrepresents nodes into a common latent space. It infers correspondenceby comparing distances between the vectors of the unmatched nodes. • MAG [70] : Manifold alignment on traditional graphs (MAG) is amethod that uses w ( u i , u j ) = | R u i ∩ R u j | / ( | R u i | + R u j ) for the calculationof node-to-node pairwise weights to build a graph for each SMN. Themethod for obtaining the node ranking result is the same as that forMAH. • CRW [59] : Collective random walk (CRW) is a joint link fusion ap-proach for predicting the intralayer links and interlayer links simulta-neously; it transfers information relating to intralayer links from onelayer to another. 23 .3. Experiment Configuration
We employed
P recision @ N ( P @ N ) [32, 3] and M AP [3] as the metricsto evaluate the performance of all methods. P @ N is defined as P @ N = m (cid:88) i =1 i { success @ N } /m, (21)where i { success @ N } indicates whether the correct interlayer link exists inthe top- N list, and m represents the number of all unobserved interlayerlinks. It is noteworthy that P recision @ N is actually the same as Recall @ N and F N in the field of interlayer link prediction because P recision @ N represents the true positive prediction rate. When N = 1, P @1 equates tothe metric of precision. M AP is used to evaluate the ranked performance of different methodsand is defined as
M AP = ( n (cid:88) i =1 r i ) /m, (22)where r i represents the rank of the i th unmatched interlayer link. The higherthe values of P @ N and M AP , the better the performance of the method.To test the performance, the set of all interlayer links was randomlydivided into two parts for each experiment (i) a training set Φ, which wastreated as the set of a priori interlayer links; and (ii) a test set Ψ, which wasused for testing and can be considered a collection of the unmatched nodepairs waiting for prediction. The ratio of the size of the training set to the sizeof the set of all interlayer links is called the training ratio, which we variedin some of the experiments. Our task was to uncover the interlayer links inthe test set based on the information in the training set and each layer ofthe multiplex network. For all experiments, we set the control parameter δ to 0 . . In this subsection, we report a comparison of the results of the base-line methods and the proposed method for different @ N settings, trainingratios, dimensionalities, and numbers of iterations. We also compare theperformance of the proposed embedding method with two state-of-the-artembedding methods, including DeepWalk [63] and node2vec [64].24 able 3: Performance of MulCEV on different δ , where 1st column is metric, 2nd columnis datasets, 3rd column is training ratios. δ P @30 FT 0.3 0.4224 0.4407 0.4471 0.4546 0.4614 0.4710 0.4869 0.5036 0.5151 P r e c i s i on (a) CRWMAHMAGONEPALEIONEBootEADeepLinkIONE-DMulCEVMulCEV-Ex 5 10 15 20 25 30N00.050.10.150.20.250.30.350.40.450.5 P r e c i s i on (b) MAHMAGONEPALEIONEBootEAIONE-DMulCEVMulCEV-Ex
Figure 4: Comparison between baselines and our proposed methods for different @ N settings. (a) P recision of different @ N settings on the dataset FT, (b) P recision ofdifferent N settings on the dataset DBLP. @ N settings We first evaluate the performance of the baseline methods and the pro-posed method at different @ N settings. In the proposed framework, the firststep is cross-layer extension. This step, however, is not mandatory. We callthe version without cross-network extension MulCEV, and the version withcross-network extension MulCEV-Ex.Referring to Refs. [71] and [33], we set 90 .
0% of the interlayer links asthe training set and the rest as the test set. Figure 4 displays the precisionof the baseline methods and the proposed method under this setting. Fromthe figure, we can see that MulCEV-Ex achieved the highest precision forall @ N settings. On the FT dataset, the precision increased by a maximumof 10 .
8% and an average of 5 .
8% over DeepLink, the best of the baselinemethods. On the DBLP dataset, the precision increased by a maximum of25 able 4:
M AP of different methods.
Datasets Methods
MulCEV-Ex MulCEV IONE-D BootEA IONE PALE ONEFT .
7% and an average of 2 .
3% over BootEA. MulCEV achieved the second-highest performance, for a maximum increase of 6 .
8% and 5 .
7% on the tworespective datasets compared with the best of the baseline methods. In con-trast with other methods based on network embedding, our method furtherconsiders distance consistency with CMNs. The results imply that the dis-tance consistency provides more clues and better facilitates the predictionof interlayer links. MulCEV-Ex was better than MulCEV under most set-tings because MulCEV-Ex leverages a priori interlayer links to extend eachlayer of the multiplex network. The extended layer has more edges to guidethe embedding than the non-extended network so that the node positionsin the embedding space can better reflect the relationships between nodes.Such advantages are highlighted in the subsequent matching process. Theimprovement of MulCEV-Ex over MulCEV was greater on the FT datasetthan on the DBLP dataset, as the percentage of interlayer nodes is greater inthe FT dataset than in the DBLP dataset. The greater the number of inter-layer links, the greater the number of intralayer links that can be extended.CRW was the lowest-precision method, showing that the traditional link-based prediction method is not as accurate as the network embedding ap-proach. MAH and MAG showed better performance than CRW but were abit worse than the other methods. This may be because MAH needs hyper-graph information, which is often difficult to obtain for an actual SMN andthus must be built using specific methods based on the data obtained, lead-ing to poor performance by MAH. MAG uses a formula for the calculationof node-to-node pairwise weights to build a graph for each SMN and obtainsranking results by the same method as MAH, so its performance is similarto MAH’s.With regard to IONE and its two variants ONE and IONE-D, ONE doesnot consider the information of the input context of the node, and its perfor-mance was not as good as that of IONE. IONE-D, although based on IONE,further considers the impact of community-based structural diversity, and26o it exhibited better performance than IONE. PALE does not consider theinput and output contexts of the node separately; its performance was notas good as IONE’s. It is noteworthy that IONE and PALE are two classicalmethods based on network embedding; IONE embeds all layers into a com-mon latent space, and PALE embeds each layer into a unique space. On thetwo datasets, MulCEV increased the precision by an average of 7 .
0% and6 . .
4% and 15 . N increases, the precision of the var-ious methods also increases. This is because @ N denotes the number of po-tential matches recommended by different methods for each unmatched node.The greater the value of N , the higher the number of candidate matches andthe higher the probability of success in finding the correct match.We also investigated the ranking performance of our suggested methodsand some baselines with the 90 .
0% training ratio; Table 4 shows the re-sults. The highest value for each dataset is in boldface. We can see thatMulCEV-Ex outperformed all the comparison methods, and MulCEV wasbetter than all the baseline methods. This observation further demonstratesthe effectiveness and merits of the proposed framework.It is worth noting that of all the methods, MulCEV-Ex is the only onethat extends the intralayer links by interlayer links. Therefore, in order toconduct the comparison of the different methods under the same conditionsto the extent possible, in the subsequent experiments we excluded MulCEV-Ex and used only MulCEV for the comparisons with the baselines.
We evaluated the performance of the baselines and MulCEV under dif-ferent settings for the training ratio. We set training ratios of 10% to 90% in270% increments; N = 30. Figure 5 displays the P @30 of the baselines andMulCEV under these settings.From the figure, we can see that the proportion of interlayer links usedfor training markedly affected the performance of all of the methods. Foreach method, P @30 increased with the training ratio. This is because thegreater the training ratio, the greater the quantity of training data. Forthe methods that embed each layer into a unique latent space, there aremore inputs to learn the mapping function; for the method embedding alllayers into a common latent space, there are more inputs to align nodes tothe common embedding space. Moreover, the rankings of the performanceof all methods on the two datasets are similar to those under various @ N settings. The reasons are the same as those illustratedin Fig. 4. In particular,MulCEV achieved the highest precision for almost all training ratios. On theFT dataset, P @30 increased by a maximum of 4 .
8% and an average of 0 . .
0% and an average of 0 .
4% over the best baseline, IONE-D.These observations demonstrate the effectiveness and merits of the proposedmethod.
10 20 30 40 50 60 70 80 90Training ratio00.10.20.30.40.50.60.7 P r e c i s i on @ (a) CRWMAHMAGONEPALEIONEBootEADeepLinkIONE-DMulCEV 10 20 30 40 50 60 70 80 90Training ratio00.050.10.150.20.250.30.350.40.450.5 P r e c i s i on @ (b) MAHMAGONEPALEIONEBootEAIONE-DMulCEV
Figure 5: Comparison between baselines and MulCEV for different training ratios. (a) P @30 of different training ratios on the dataset FT, (b) P @30 of different training ratioson the dataset DBLP. We also evaluated the performance of the network embedding learning-based baseline methods and MulCEV using representations of different di-mensionalities d . We set d to 16, 32, 64, 128, and 256; the training ratio was90 . N = 30. Figure 6 displays the P @30 values for the baselines andMulCEV under these settings. 28rom Fig. 6, we can see that the rankings of the performance of all meth-ods on the two datasets are similar to the rankings under various @ N settings.The reasons are the same as those illustrated in Fig. 4. In particular, Mul-CEV achieved the highest precision for almost all dimensionalities. On theFT dataset, P @30 increased by a maximum of 8 .
2% and an average of 4 . .
3% and an average of 1 .
6% over the best baseline, BootEA.These observations demonstrate the effectiveness and merits of the proposedframework. Moreover, we can see that MulCEV, DeepLink, IONE-D, andIONE achieved their best performance on the FT dataset with d = 128, andon the DBLP dataset with d = 64. Other methods needed more dimensionsto achieve their best performance. It is well known that the computationalcomplexity of learning algorithms is highly dependent on the dimensionalityof the embedding space: The lower the dimensionality, the lower the compu-tational complexity. These results again demonstrate the effectiveness andmerits of the proposed framework.
50 100 150 200 250Dimensionality00.10.20.30.40.50.60.70.8 P r e c i s i on @ (a) MAHMAGONEPALEIONEBootEADeepLinkIONE-DMulCEV 50 100 150 200 250Dimensionality00.10.20.30.40.50.6 P r e c i s i on @ (b) MAHMAGONEPALEIONEBootEAIONE-DMulCEV
Figure 6: Comparison between baselines and MulCEV for different dimensionalities. (a) P @30 of different embedding dimensionality on the dataset FT, (b) P @30 of differentembedding dimensionality on the dataset DBLP. The number of training iterations needed for a prediction method to con-verge is another important factor to consider in evaluating these methods.Referring to Refs. [71, 33], we set the training ratio to 90 .
0% and set N = 30to execute the experiments for the evaluation of the baselines and MulCEVwith different numbers of iterations. Figure 7 displays P @30 for the baselinesand MulCEV under these settings. 29rom the figure, we can see that the rankings of the performance of allmethods on the two datasets are similar to those under various @ N settings.The reasons are the same as those illustrated in Fig. 4. Meanwhile, MulCEVachieved the highest precision at almost all iteration counts. In particular,it achieved competitive results at very low training iteration counts; at 2000iterations, it achieved P @30 values of 0 .
667 and 0 .
512 on the two respec-tive datasets. In contrast, the P @30 values of all the baselines were closeto zero. This is because the degree of match for MulCEV consists of twoparts, degree of vector consistency and degree of distance consistency. Thelatter is calculated in advance (before the mapping function is learned) andprovides some clues for making the predictions. Thereafter, the mappingfunction improves as the number of iterations increases, further improvingthe prediction performance.DeepLink and PALE converged at similar iteration counts, around 10 .This is probably because these methods are based on similar concepts. Theyembed each layer of the multiplex network into a unique latent space andthen use MLP to learn the mapping function and complete the matching.The convergent iteration counts for IONE and its two variant methods ONEand IONE-D are also similar, all of them converging between 10 and 10 .The reason is the same as that for DeepLink and PALE. Moreover, we cansee that DeepLink and PALE converge to their best performance soonerthan IONE and its variant methods. This is probably because IONE and itsvariants need to learn the context information for the nodes in each layer,and so they require a greater number of learning rounds to converge. P @30for IONE and PALE would decrease at higher iteration counts because theyincur overfitting problem. To evaluate the weighted-embedding method proposed in Section IV, wecompared it with two commonly used network embedding methods: Deep-Walk [63] and node2vec [64]. For DeepWalk, we set the number of walks pernode to 20, the walk length to 80, and the window size to 5; for node2vec,we empirically set q = 0 . p = 2.We compared the proposed weighted-embedding method and the compar-ison methods under various @ N settings and various training ratios on theFT dataset. Figure 8 displays the results. As can be seen in the figure, Mul-CEV achieved the highest precision at almost all @ N settings and trainingratios. At the different @ N settings, P @ N increased by a maximum of 6 . Iteration Number00.10.20.30.40.50.60.7 P r e c i s i on @ (a) ONEPALEIONEDeepLinkIONE-DMulCEV 10 Iteration Number00.10.20.30.40.50.6 P r e c i s i on @ (b) ONEPALEIONEIONE-DMulCEV
Figure 7: Comparison between baselines and MulCEV for different training iterationcounts. (a) P @30 of different training iteration counts on the dataset FT, (b) P @30of different training iteration counts on the dataset DBLP. and an average of 2 .
9% over DeepWalk, the better of the two comparisonembedding methods. At the different training ratios, P @30 increased by amaximum of 3 .
8% and an average of 2 .
7% over node2vec, the better of thetwo comparison embedding methods. These observations demonstrate theeffectiveness and merits of the proposed weighted-embedding method. P r e c i s i on (a) DeepWalknode2vecMulCEV 10 20 30 40 50 60 70 80 90Training ratio0.250.30.350.40.450.50.550.60.650.7 P r e c i s i on @ (b) DeepWalknode2vecMulCEV
Figure 8: Comparison of different embedding methods. (a) Precision of different @ N settings on the dataset FT, (b) P @30 of different training ratios on the dataset FT.
6. Conclusion
We have proposed a framework called MulCEV to predict the interlayerlinks in a multiplex network. This framework makes full use of the informa-tion in the latent representation space through vector consistency and dis-tance consistency. Distance consistency leverages CMNs of the unmatched31odes across different layers as references to provide additional clues for inter-layer link prediction. In addition, we modeled the layers as weighted graphsto obtain better representation for network embedding so that the higherthe strength of the nodes’ relationships, the more similar their embeddingvectors in the latent representation space. To reduce the time complexity,we adopted matrix multiplication to optimize the process for calculating thedegree of match. Experiments on two real-world multiplex network datasetsdemonstrated that the proposed MulCEV framework markedly outperformsseveral state-of-the-art methods.In summary, the proposed framework further improves the accuracy of thenetwork-based embedding method for dealing with interlayer link prediction,especially when the number of training iterations is low. The framework caneffectively associate the accounts belonging to the same user across differentSMNs solely by leveraging network structure attributes in the absence of at-tribute information such as username, age, or published content. Such anassociation can be used to establish patterns of law violations by cybercrim-inals, improve the understanding of information diffusion across SMNs, andprovide support for criminal investigations and evidence collection throughSMNs. In the future, we plan to further explore more reasonable embeddingmethods to capture the network structure and make predictions in scenariosin which the number of nodes is dynamically increased.
7. Acknowledgments
This work was supported by the National Natural Science Foundationof China under Grant Nos. U19A2081, 81602935, 81773548, 61802270, and61802271; the Fundamental Research Funds for the Central Universities un-der Grant No. SCU2020D038; the Sichuan Science and Technology Programunder Grant No. 20YYJC4001.