GANE: A Generative Adversarial Network Embedding
GGANE: A Generative Adversarial Network Embedding
Huiting Hong, Xin Li and Mingzhong Wang
ABSTRACT
Network embedding has become a hot research topic recently whichcan provide low-dimensional feature representations for many ma-chine learning applications. Current work focuses on either (1)whether the embedding is designed as an unsupervised learningtask by explicitly preserving the structural connectivity in the net-work, or (2) whether the embedding is a by-product during thesupervised learning of a specific discriminative task in a deep neu-ral network. In this paper, we focus on bridging the gap of the twolines of the research. We propose to adapt the Generative Adversar-ial model to perform network embedding, in which the generatoris trying to generate vertex pairs, while the discriminator triesto distinguish the generated vertex pairs from real connections(edges) in the network. Wasserstein-1 distance is adopted to trainthe generator to gain better stability. We develop three variations ofmodels, including GANE which applies cosine similarity, GANE-O1which preserves the first-order proximity, and GANE-O2 whichtries to preserves the second-order proximity of the network in thelow-dimensional embedded vector space. We later prove that GANE-O2 has the same objective function as GANE-O1 when negativesampling is applied to simplify the training process in GANE-O2.Experiments with real-world network datasets demonstrate thatour models constantly outperform state-of-the-art solutions withsignificant improvements on precision in link prediction, as well ason visualizations and accuracy in clustering tasks.
KEYWORDS
Network Embedding, Generative Adversarial Model, Wassersteindistance, Link Prediction
Representation learning, which provides low-dimensional vector-space representations for the data, is an important research field inmachine learning since it can significantly simplify the algorithms.Representation learning for networks, a.k.a. network embedding,is specifically important for applications with massive amount ofnetwork-style of data, such as social networks and email graphs.The purpose of network embedding is to generate informativenumerical representations of nodes and edges, which in turn enablefurther inference in the network data, such as link prediction andvisualization.Most existing methods in network embedding explicitly de-fine the representative structures of the network as some numeri-cal/computational measurements during the representation learn-ing process. For example, LINE [22] and SDNE[26] use local-structure(first-order proximity) and/or global structure (second-order prox-imity), while DeepWalk [19] uses network community structure.The learned representations are then sent to some machine learning
Conference’17, July 2017, Washington, DC, USA © 2018 Association for Computing Machinery. toolkits to guide a specific discriminative task, such as link predic-tion with classification or regression models. That is, the networkembedding is learned separately from the actual tasks. Therefore,the learned representations may not be capable of optimizing theobjective functions of tasks directly.Alternative solutions utilize deep models to retrieve low-dimens-ional network representations. For example, Li et al. [13] use a vari-ational autoencoder to represent an information network. Thus,the representations obtained right before the decoder layer are con-sidered as a learned representations when the reconstruction lossconverges. HNE[6] studies network embedding for heterogeneousdata. It integrates deep models into a unified framework to solvethe similarity prediction problem. Similarly, the output of the layerbefore the prediction layer in HNE is treated as the embeddings.In these approaches, the network embedding is a by-product ofapplying deep models on a specific task.However, the networking embeddings in existing solutions aresomewhat handcrafted structures. Moreover, there is no systemati-cal support to exhaustively explore potential structures of networks.Therefore, this paper proposes to incorporate Generative adversar-ial networks (GANs) into network embeddings.GANs[9] are promising frameworks for various learning tasks,especially in computer vision area. Technically, a GAN consists oftwo components: a discriminator trying to predict whether a givendata instance is generated by the generator or not, and a generatortrying to fool the discriminator. Both components are trained byplaying minimax games from game theory. Various extensions havebeen proposed to address some theoretical limitations in GANs.For example, [7][32] introduced modifications to GAN objectives,WGAN [3] confined the distribution measurement to improve thetraining stability, and IRGAN [27] extended the application domainsto information retrieval tasks.Motivated by the empirical success of the adversarial trainingprocess, we propose Generative Adversarial Network Embedding(GANE) to perform the network embedding task. To simplify the dis-cussion, GANE only considers single relational networks in whichthe edges are of the same type in comparison to multi-relational net-works with various types of edges. Hereafter, network embeddingstands for network embedding for single relational networks.In GANE, the generator tries to generate potential edges for avertex and construct the representations for these edges, while thediscriminator tries to distinguish the generated edges from real onesin the network and construct its own representations. Besides usingcosine similarity, we also adopted the first-order proximity to definethe loss function for the discriminator, and to measure the struc-tural information of the network preserved in the low-dimensionalembedded vector space. Under the principles of minimax game,the generator tries to simulate the structures of the network withthe hints from the discriminator, and the discriminator in turnexploits the underlying structure to recover missing links for thenetwork. Wasserstein-1 distance is adopted to train the generatorwith improved stability as suggested in WGAN[3]. To the best of a r X i v : . [ c s . L G ] M a y onference’17, July 2017, Washington, DC, USA Huiting Hong, Xin Li and Mingzhong Wang our knowledge, this is the first attempt to learn network embeddingin a generative adversarial manner. Experiments on link predictionand clustering tasks were executed to evaluate the performanceof GANE. The results show that network embeddings learned inGANE can significantly improve the performance for superviseddiscrimination tasks in comparison with existing solutions. Themain contributions of this paper can be summarized as follows: • We develop a generative adversarial framework for networkembedding. It is capable of performing the feature represen-tation learning and link prediction simultaneously under theadversarial minimax game principles. • We discuss three variations of models, including naive GANEwhich applies cosine similarity, GANE-O1 which preservesthe first-order proximity, and GANE-O2 which tries to pre-serves the second-order proximity of the network in thelow-dimensional embedded vector space. • We evaluate the proposed models with detailed experimentson link prediction and clustering tasks. Results demonstratesignificant and robust improvements in comparison withother state-of-the-art network embedding approaches.The rest of the paper is organized as follows. Section 2 summa-rizes the related work. Section 3 illustrates the design and algo-rithms in GANEs. Section 4 presents the experimental design andresults. Section 5 concludes the paper.
The paper roots into two research fields, network embedding andgenerative adversarial networks.
Extensive efforts have been devoted into the research about net-work embedding in recent years. Graph Factorization [1] representsthe network as an affinity matrix of graph, and then utilizes a dis-tributed matrix factorization algorithm to find the low-dimensionalrepresentations of the graph. DeepWalk [19] utilizes the distribu-tion of node degree to model topological structures of the networkvia the random walk and skip-gram to infer the latent represen-tations of vertices in networks. Tang et al. proposed LINE [22] topreserve both local (first-order) structures and global (second-order)structures during the embedding process by minimizing the KL-divergence between the distributions of structures in the originalnetwork and the embedded space. LINE has been considered as oneof the most popular network embedding approaches in the past twoyears. Thereafter, Wang et al. proposed modularized nonnegativematrix factorization to incorporate the community structures andpreserve such structures during representation learning [28]. SDNE[26] applies a semi-supervised deep learning framework for net-work embedding, in which the first-order proximity is preserved bypenalizing the similar vertices but far away in the embedded spaceand the second-order proximity is preserved by using a deep autoen-coder. Li et al. [14] incorporated the text information and structureof the network into the embedding representations by employ-ing the variational autoencoder(VAE) [12]. Chang et al. proposedHNE[6] to address network embedding tasks with heterogeneousinformation, in which deep models for content feature learning andstructural feature learning are integrated in a unified framework. In summary, existing solutions in networking embedding eitheruse shallow models or deep models to preserve different structuralproperties in the low-dimensional space, such as the connectivity be-tween the vertices (the first order proximity), neighborhood connec-tivity patterns (the second order proximity), and other high-orderproximities (the community structure). However, these solutionsemploy handcrafted structures for the network embedding, and it ishard to exhaustively explore potential structures of networks due tothe lack of systematical support. Therefore, we propose to leverageon generative adversarial framework to explore potential structuresin the networks to achieve more informative representations.
Recent advances in Generative Adversarial Networks (GANs)[9]have proven GANs as a powerful framework for learning complexdata distributions. The core idea is to define the generator and thediscriminator to be the minimax game players competing with eachother to push the generator to produce high quality data to fool thediscriminator.Mirza & Osindero introduced conditional GANs [18] to controlthe data generation by setting conditional constraints on the model.InfoGAN [7], another information-theoretic extension to the GANmodel, maximizes the mutual information between a small subset ofthe latent variables and the observations to learn interpretable andmeaningful hidden representations on image datasets. SeqGAN[32]models the data generator as a stochastic policy in reinforcementlearning and uses the policy gradient to guide the learning processbypassing the generator differentiation problem for discrete dataoutput.Despite their successes, GANs are notably difficulty to train andprone to mode collapse [2], especially for discrete data. Energy-based GAN (EBGAN) [33] tries to achieve a stable training processby viewing the discriminator as an energy function that attributeslow energies to the regions near the data manifold and higherenergies to other regions. However, EBGANs, which regularize thedistribution distance as Jensen-Shannon (JS) divergence, share thesame problem as classical GANs that the discriminator cannot betrained well enough, as the distance EBGANS adopted cannot offerperfect gradients. Replacing JS with the Earth Mover (EM) distance,Wasserstein-GAN [3] theoretically and experimentally solves theproblem of model fragility.GANs are successfully applied in the field of computer vision fortasks including generating sample images. However, there are fewattempts to apply GANs to other machine learning tasks. Recently,IRGAN [27] has been proposed as an information retrieval modelin which the generator focuses on predicting relevant documentsgiven a query and the discriminator focuses on distinguish whetherthe generated documents are relevant. It showed superior perfor-mance over the state-of-the-art information retrieval approaches.In this paper, we propose to explore the strength of generativeadversarial models for network embedding. The proposed frame-work, GANE, performs the feature representation learning and linkprediction simultaneously under the adversarial minimax gameprinciples. Wasserstein-1 distance is adopted to define the overallobjective function [3] to overcome the notorious unstable trainingproblem in conventional GANs.
ANE: A Generative Adversarial Network Embedding Conference’17, July 2017, Washington, DC, USA (cid:256) (cid:56)(cid:81)(cid:82)(cid:69)(cid:86)(cid:72)(cid:85)(cid:89)(cid:72)(cid:71) (cid:257) (cid:3) (cid:40)(cid:71)(cid:74)(cid:72) (cid:42)(cid:72)(cid:81)(cid:72)(cid:85)(cid:68)(cid:87)(cid:82)(cid:85) (cid:42) ertex i V v (cid:39)(cid:76)(cid:86)(cid:70)(cid:85)(cid:76)(cid:80)(cid:76)(cid:81)(cid:68)(cid:87)(cid:82)(cid:85) (cid:39) (cid:55)(cid:85)(cid:88)(cid:86)(cid:87)(cid:3)(cid:54)(cid:70)(cid:82)(cid:85)(cid:72) (cid:3)(cid:82)(cid:73)(cid:3)(cid:40)(cid:71)(cid:74)(cid:72) ( )
D e ' ' ( , ) i j e v v = (cid:50)(cid:69)(cid:86)(cid:72)(cid:85)(cid:89)(cid:72)(cid:71)(cid:3)(cid:40)(cid:71)(cid:74)(cid:72) ( , ) i j e v v = Figure 1: Architecture and dataflow in GANE.
A network N can be modeled as a set of vertices V and a set ofedges E . That is, a network can be represented as N = ( V , E ) . Theprimal task of network embedding is to learn a low-dimensionalvector-space representation for each vertex v i ∈ V . Unlike existingapproaches which need to train the embedding presentation beforeapplying it to predictive tasks, we facilitate predictions and theembedding learning process in a unified framework by leveragingon generative adversarial model. Two components, the generator G and the discriminator D , aredefined in GANE to play the minimax game. The task of G is topredict the well-matched edge ( v i , v j ) given v i , while the task of D is to identify the observed edges from the "unobserved" edges,where the "unobserved" edges are generated by G . The overallarchitecture and dataflow of GANE are depicted in Fig.1.To avoid the problem of unstable training in conventional GANmodels which are prone to mode collapse, the Earth-Mover (alsocalled Wasserstein) distance W (P E ′ , P E ) is utilized to define theoverall objective function as suggested in WGAN [3]. W (P E ′ , P E ) can be informally defined as the minimum cost transporting massfrom the distribution P E ′ into the distribution P E . With mild as-sumptions, W (P E ′ , P E ) is continuous and differentiable almosteverywhere. Following WGAN, the objective function is definedafter the Kantorovich-Rubinstein duality[25]:min G θ max D ϕ ( E e ∼P E [ D ( e )] − E e ′ ∼P E ′ [ D ( e ′ )]) (1) where P E is the distribution of observed edges, and P E ′ is the distribu-tion of unobserved edges generated by G . That is, e ′ = ( v i , v j ) ∈ G ( v i ) and G ( v i ) ∼ p θ ( v j | v i ) . D ( e ) is the utility function which computesthe trust score (a scalar) for a given edge e = ( v i , v j ) . A naive version of GANE, or GANE, is defined with the followingscoring function without considering the structural information inthe network: D ( e ) = s ( v i , v j ) = cos ((cid:174) u i , (cid:174) u j ) = (cid:174) u Ti (cid:174) u j | (cid:174) u i | · | (cid:174) u j | (2)where (cid:174) u i , (cid:174) u j ∈ R d are the embedding representation of vertex v i and v j respectively.The discriminator is trained to assign high trust score to an ob-served edge e but lower score to an unobserved edge e ′ generatedby G, while the generator is trained to produce contrastive edgeswith maximal trust score. Theoretically, there is a Nash equilibriumin which the generator perfectly fits in with the distribution of ob-served edges in the network (i.e., P E ′ = P E ), and the discriminatorcannot distinguish true observed edges from the generated ones.However, it is computationally infeasible to reach such an equilib-rium because the distribution of embedding in the low-dimensionalspace keeps changing dynamically along with the model trainingprocess. Consequently, the generator tends to learn the distribution P E ′ to model the network structure as accurately as possible, whilethe discriminator tends to accept the potential true (unobservedbut in all probability be true) edges. Structural information of the network may provide valuable guid-ance in the model learning process. Therefore, we propose to ex-tend the discriminator definition in GANE with the concepts offirst-order and second-order proximity, which were introduced inLINE[22].
Definition 3.1. ( First-order Proximity ) The first-order proxim-ity in a network describes the local pairwise proximity between twovertices. The strength between two vertices v i and v j is denoted as w ij . w ij = indicates there is no edge observed between v i and v j .The intuitive solution is to embed the vertices with strong ties(i.e. high w ij ) close to each other in the low-dimensional space.Therefore, w ij can be used as the weighting factor to evaluate theembedding representation.For network embedding, the goal is to minimize the differencebetween the probability distribution of the edges in the originalspace and that in the embedded space. The distance between theempirical probability distribution ˆ p (· , ·) and the resulting proba-bility distribution p (· , ·) in the network embedding can be definedas O = distance ( ˆ p (· , ·) , p (· , ·)) = − (cid:213) ( v i , v j )∈ E w ij log p ( v i , v j ) (3)where p ( v i , v j ) is the joint probability between vertex v i and v j and E is the set of observed edges in the network. The em-pirical probability is defined as ˆ p ( i , j ) = w ij / W , where W = (cid:205) ( v i , v j )∈ E w ij . For each edge ( v i , v j ) , p ( v i , v j ) is defined as p ( v i , v j ) = σ ((cid:174) u Ti (cid:174) u j ) (4) onference’17, July 2017, Washington, DC, USA Huiting Hong, Xin Li and Mingzhong Wang ALGORITHM 1:
Network Embedding Learning in GANE
Input: α , the learning rate. c , the clipping parameter. m , thebatch size. M , the number of generated edges given asource vertex. T , the training number for D . Input: V , the set of vertices in the network. E , the set of edgesin the network.Randomly initialize ϕ , θ repeatfor t = , ..., T − do Sample { e k } mk = ∼ P E , a batch of edges from thenetwork.Sample { e ′ k } mk = ∼ P E ′ , a batch of edges from thegenerated pool.Update ϕ to minimize m (cid:205) mk = D ( e ′ k ) − m (cid:205) mk = D ( e k ) clip ( ϕ , − c , c ) end Sample { v i } mi = ∼ V , a batch of vertices from the network. for each v i do Sample { e j = ( v i , v j )} Mj = ∼ P E ′ end Update θ to minimize − m (cid:205) mi = ( M (cid:205) Mj = D ( e j )) until convergence ;Following Eq.(1), it is equivalent for the discriminator to minimizethe loss of GANE-O1, which is GANE with first-order proximity, as L O D ϕ = E e ′ ∼P E ′ [ D ( e ′ )] − E e ∼P E [ D ( e )] = E ( v i , v j ′ )∼P E ′ [ w ij ′ log p ( v i , v j ′ )]− E ( v i , v j )∼P E [ w ij log p ( v i , v j )] = E ( v i , v j ′ )∼P E ′ [ w ij ′ log σ ((cid:174) u Ti (cid:174) u j ′ )]− E ( v i , v j )∼P E [ w ij log σ ((cid:174) u Ti (cid:174) u j )] . (5) Definition 3.2. ( Second-order Proximity ) The second-order prox-imity between a pair of vertices ( v i , v j ) describes the similarity be-tween their neighborhood structure in the network. Let (cid:174) W i = ( w i , w i ,..., w i | V | ) denote the first-order proximity of v i with other vertices.Then, the second-order proximity between v i and v j is determined bythe similarity between (cid:174) W i and (cid:174) W j . The intuitive solution is to embed the vertices which have highsecond-order proximity close to each other in the low-dimensionalspace. By analogy with the corpus in natural language processing(NLP), the neighbors of v i can be treated as its context (nearbywords), and a vertex v j with similar context is considered to besimilar. Similar to the skip-gram model [17], the probability of"context" v j generated by v i is defined with the softmax functionas p ( v j | v i ) = exp ((cid:174) u Tj (cid:174) u i ) (cid:205) | V | k = exp ((cid:174) u Tk (cid:174) u i ) . (6)The objective function, which is the distance between the empiri-cal conditional distribution ˆ p (·| v i ) and the resulting conditional distribution in the embedded space p (·| v i ) , is defined as O = (cid:213) v i ∈ V λ v i distance ( ˆ p (·| v i ) , p (·| v i )) = − (cid:213) v i ∈ V (cid:213) { j |( v i , v j )∈ E } λ v i ˆ p ( v j | v i ) log p ( v j | v i ) (7)where λ v i denotes the prestige of v i in the network. For simplicity,the sum of weights in (cid:174) W i is used as the prestige of v i . That is, λ v i = (cid:205) | V | j = w ij . The empirical distribution ˆ p (·| v i ) is defined asˆ p ( v j | v i ) = w ij (cid:205) | V | j = w ij . (8)Then, Eq.(7) can be rewritten as O = − (cid:213) ( v i , v j )∈ E w ij log p ( v j | v i ) = − (cid:213) ( v i , v j )∈ E w ij log ( exp ((cid:174) u Tj (cid:174) u i ) (cid:205) | V | k = exp ((cid:174) u Tk (cid:174) u i ) ) . (9)However, the computation of the objective function Eq.(9) remainsexpensive because the softmax term p ( v j | v i ) needs to sum up allvertices of the network. A general solution is to apply negativesampling [8] to bypass the summations. This solution is based onNoise Contrastive Estimation (NCE) [10] which shows that a goodmodel should be able to differentiate data from noise by means oflogistic regression. With the method of negative sampling,log p ( v j | v i ) = log σ ((cid:174) u Tj (cid:174) u i ) + K (cid:213) k = E v k ∼P k ( v ) [ log σ (−(cid:174) u Tk (cid:174) u i )] . (10)By replacing D ( e ) in Eq.(1) with updated Eq.(10) via aforemen-tioned negative sampling counterpart, it is equivalent for the dis-criminator to minimize the loss of GANE-O2, which is GANE withsecond-order proximity, as L O D ϕ = E e ′ ∼P E ′ [ D ( e ′ )] − E e ∼P E [ D ( e )] = E ( v i , v j ′ )∼P E ′ [ w ij ′ ( log σ ((cid:174) u Tj ′ (cid:174) u i ) + K (cid:213) k ′ = E v k ′ ∼P k ( v ) [ log σ (−(cid:174) u Tk ′ (cid:174) u i )])]− E ( v i , v j )∼P E [ w ij ( log σ ((cid:174) u Tj (cid:174) u i ) + K (cid:213) k = E v k ∼P k ( v ) [ log σ (−(cid:174) u Tk (cid:174) u i )])] . (11)The noise distribution P k ( v ) is empirically set to 3 / P k ( v ) ∝ W / v , where W v = (cid:205) | V | j = w vj . In gen-eral, the larger number of negative sampling K , the better perfor-mance of the model. Moreover, (cid:205) Kk ′ = E v k ′ ∼P k ( v ) [ log σ (−(cid:174) u Tk ′ (cid:174) u i )] and (cid:205) Kk = E v k ∼P k ( v ) [ log σ (−(cid:174) u Tk (cid:174) u i )] will be equivalent when K isinfinite. Therefore, Eq.(11) can be updated as L O D ϕ = E ( v i , v j ′ )∼P E ′ [ w ij ′ log σ ((cid:174) u Tj ′ (cid:174) u i )]− E ( v i , v j )∼P E [ w ij log σ ((cid:174) u Tj (cid:174) u i )] = L O D ϕ (12) ANE: A Generative Adversarial Network Embedding Conference’17, July 2017, Washington, DC, USA
Table 1: The Optimization Direction and Ranking Score for Comparison
Models Distribution Objective Function OptimizationDirection RankingMeasurement ScoreLINE-O1 KL-divergence min (cid:205) ( v i , v j )∈ E − w ij log p ( v i , v j ) p ( v i , v j ) = σ ( u Ti u j ) u Ti u j LINE-O2 KL-divergence min (cid:205) ( v i , v j )∈ E − w ij log p ( v j | v i ) p ( v j | v i ) = exp ( u Ti u j ) (cid:205) | V | k = exp ( u Ti u k ) u Ti u j LINE-(O1+O2) KL-divergence min (cid:205) ( v i , v j )∈ E − w ij log p ( v i , v j ) p ( v i , v j ) = σ ( u Ti u j ) u Ti u j min (cid:205) ( v i , v j )∈ E − w ij log p ( v j | v i ) p ( v j | v i ) = exp ( u Ti u j ) (cid:205) | V | k = exp ( u Ti u k ) IRGAN JS-divergence min θ max ϕ (cid:205) ( E e = ( v i , v j )∼P E [ log D ( e )] D ( e ) = σ ( u Ti u j ) u Ti u j + E e ′ = ( v i , v j ′ )∼P E ′ [ loд ( − D ( e ′ ))]) GANE Wasserstein distance min θ max ϕ ( E e = ( v i , v j )∼P E [ D ( e )] D ( e ) = cosine ( u i , u j ) cosine ( u i , u j )− E e ′ = ( v i , v j ′ )∼P E ′ [ D ( e ′ )]) GANE-O1 Wasserstein distance min θ max ϕ ( E e = ( v i , v j )∼P E [ w ij log p ( v i , v j )] p ( v i , v j ) = σ ( u Ti u j ) u Ti u j − E e ′ = ( v i , v j ′ )∼P E ′ [ w ij ′ log p ( v i , v j ′ )]) which shows that GANE-O1 (Eq.(5)) and GANE-O2 have the sameobjective function. For this reason, the rest of paper will only ex-periment and discuss GANE and GANE-O1. In minimax game, the generator plays as an adversary of the dis-criminator, and it needs to minimize the loss function defined as(referring to Eq.(1)): L G θ = − E e ′ ∼P E ′ [ D ( e ′ )] (13)The generator of GANE is in charge of generating unobservededges. Different from sampling random variables during generatingprocess in conventional GANs[9, 18], GANE requires v j to be a realvertex in the network when it generates/predicts an unobservededge ( v i , v j ) for a given v i . As the sampling of vertex v j is dis-crete, Eq.(13) cannot be optimized directly. Inspired by SeqGAN[32], the policy gradient which is frequently used in reinforcementlearning[31] is applied. The derivation of the policy gradient forGANE generator G is computed as ∇ θ L G θ = ∇ θ (− E e ′ ∼P E ′ [ D ( e ′ )]) = − N (cid:213) n = ∇ θ P θ ( e ′ n ) D ϕ ( e ′ n ) = − N (cid:213) n = P θ ( e ′ n )∇ θ log P θ ( e ′ n ) D ϕ ( e ′ n ) = − E e ′ ∼P E ′ [∇ θ log P θ ( e ′ ) D ϕ ( e ′ )]≃ − M M (cid:213) j = ∇ θ log P θ ( e ′ j ) D ϕ ( e ′ j ) (14)where a sampling approximation is used in the last step. e j = ( v i , v j ) is a sample edge starting from a given source vertex v i following P E ′ ,which is the distribution of the current version of generator . Thedistribution P E ′ is determined by the parameter θ of the generator. Every time θ is updated during the model training, a new versionof distribution P E ′ is generated. M is the number of samples.In reinforcement learning terminology[21], the discriminatoracts as the environment for the generator, feeding a reward D ϕ ( e ′ j ) to the generator G when G takes an action, such as generating/pred-icting an edge ( v i , v j ) for a given v i . We randomly sample 90% edges from the network as the trainingset for the training process, and enforce the requirement that thesesamples should cover all vertices. Therefore, the embedding of allvertices could be learned in our models. For each training itera-tion, the discriminator is trained for T times, but the generatoris trained just once. Mini-batch Stochastic Gradient Descent andRMSProp[24] optimizer based on the momentum method are ap-plied as they perform well even on highly non-stationary problems.In order to have parameters ϕ lie in a compact space, the paperexperiments with simple variants with little difference and stickswith parameters clipping. For more details, please refer to[3]. Theoverall algorithm for GANEs is provided in Algorithm 1. To evaluate the proposed models, we applied the embedding repre-sentations to two task categories: link prediction, and clustering.For each category, we compared proposed GANE and GANE-O1with several state-of-the-art approaches for network embedding.The list of models in comparison includes: • LINE [22].
LINE is a very popular and state-of-the-art modelfor network embedding. Three variations of LINE were evalu-ated: LINE-O1, LINE-O2, LINE-(O1+O2). LINE-O1 and LINE-O2 consider only the first-order proximity and second-orderproximity respectively. LINE-(O1+O2) utilizes the concate-nated vectors from the outputs of LINE-O1 and LINE-O2. onference’17, July 2017, Washington, DC, USA Huiting Hong, Xin Li and Mingzhong Wang
Table 2: Binary Classification Performance Comparison For Link Prediction
Models 10% 20% 30% 40% 50% 60% 70% 80% 90%LINE-O1 (%) 73.12 75.77 77.51 78.39 79.18 79.40 79.92 80.09 80.33LINE-O2 (%) 77.83 83.71 86.90 86.19 87.08 89.25 89.21 88.91 89.99LINE-(O1+O2) (%) 82.18 86.73 85.03 89.40 91.74 90.65 92.32 92.44 93.06IRGAN (%) 58.52 62.07 63.06 62.52 64.48 58.54 66.78 63.71 63.42GANE (%)
GANE-O1 (%) 80.49 83.24 86.82 84.92 85.72 82.43 83.87 86.34 85.90 (cid:38)(cid:82)(cid:16)(cid:68)(cid:88)(cid:87)(cid:75)(cid:82)(cid:85)(cid:3)(cid:49)(cid:72)(cid:87)(cid:90)(cid:82)(cid:85)(cid:78) (cid:40)(cid:80)(cid:69)(cid:72)(cid:72)(cid:71)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:85)(cid:68)(cid:76)(cid:81)(cid:76)(cid:81)(cid:74)(cid:3)(cid:86)(cid:72)(cid:87)(cid:3) (cid:28)(cid:19)(cid:8) (cid:53)(cid:68)(cid:81)(cid:78)(cid:76)(cid:81)(cid:74)(cid:3) (cid:87)(cid:72)(cid:86)(cid:87)(cid:3)(cid:86)(cid:72)(cid:87) (cid:20)(cid:19)(cid:8) (cid:38)(cid:82)(cid:16)(cid:68)(cid:88)(cid:87)(cid:75)(cid:82)(cid:85)(cid:3)(cid:49)(cid:72)(cid:87)(cid:90)(cid:82)(cid:85)(cid:78) (cid:38)(cid:79)(cid:68)(cid:86)(cid:86)(cid:76)(cid:73)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:87)(cid:85)(cid:68)(cid:76)(cid:81)(cid:76)(cid:81)(cid:74)(cid:3)(cid:86)(cid:72)(cid:87)(cid:3)(cid:8) (cid:38)(cid:79)(cid:68)(cid:86)(cid:86)(cid:76)(cid:73)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:87)(cid:72)(cid:86)(cid:87)(cid:3)(cid:86)(cid:72)(cid:87)(cid:8) (cid:28647)(cid:28660)(cid:28678)(cid:28670) (cid:28647)(cid:28677)(cid:28660)(cid:28668)(cid:28673)(cid:28668)(cid:28673)(cid:28666) (cid:28647)(cid:28664)(cid:28678)(cid:28679)(cid:28632)(cid:28672)(cid:28661)(cid:28664)(cid:28663)(cid:28663)(cid:28668)(cid:28673)(cid:28666) (cid:28620)(cid:28611)(cid:28600)(cid:28595)(cid:28664)(cid:28663)(cid:28666)(cid:28664)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)(cid:28595)(cid:28603)(cid:28662)(cid:28674)(cid:28681)(cid:28664)(cid:28677)(cid:28668)(cid:28673)(cid:28666)(cid:28595)(cid:28660)(cid:28671)(cid:28671)(cid:28595)(cid:28681)(cid:28664)(cid:28677)(cid:28679)(cid:28668)(cid:28662)(cid:28664)(cid:28678)(cid:28604) (cid:28641)(cid:28610)(cid:28628)(cid:28645)(cid:28660)(cid:28673)(cid:28666)(cid:28670)(cid:28668)(cid:28673)(cid:28666) (cid:28641)(cid:28610)(cid:28628) (cid:28647)(cid:28667)(cid:28664)(cid:28595)(cid:28677)(cid:28664)(cid:28678)(cid:28679)(cid:28595)(cid:28612)(cid:28611)(cid:28600)(cid:28595)(cid:28664)(cid:28663)(cid:28666)(cid:28664)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)(cid:28629)(cid:28668)(cid:28673)(cid:28660)(cid:28677)(cid:28684)(cid:28595)(cid:28630)(cid:28671)(cid:28660)(cid:28678)(cid:28678)(cid:28668)(cid:28665)(cid:28668)(cid:28662)(cid:28660)(cid:28679)(cid:28668)(cid:28674)(cid:28673) (cid:28683)(cid:28600)(cid:28595)(cid:28664)(cid:28663)(cid:28666)(cid:28664)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)(cid:28607)(cid:28595)(cid:28603)(cid:28683)(cid:28624)(cid:28612)(cid:28611)(cid:28607)(cid:28595)(cid:28613)(cid:28611)(cid:28607)(cid:28694)(cid:28607)(cid:28620)(cid:28611)(cid:28604) (cid:28647)(cid:28667)(cid:28664)(cid:28595)(cid:28677)(cid:28664)(cid:28678)(cid:28679)(cid:28595)(cid:28603)(cid:28612)(cid:28611)(cid:28611)(cid:28608)(cid:28683)(cid:28604)(cid:28600)(cid:28664)(cid:28663)(cid:28666)(cid:28664)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)
Figure 2: Dataset Partition • IRGAN [27].
IRGAN was selected as a representative forGAN-related models. IRGAN is designed as a minimax gameto improve the performance of information retrieval tasks.To enable comparison, we turned IRGAN into a model fornetwork embedding by featuring parameters in IRGAN asthe low-dimensional representations for the network. • GANE.
The naive GANE model was defined in Section 3.1. Itevaluates the trust score of an edge with the cosine distancebetween two vertices in low-dimensional vector space. • GANE-O1.
The GANE model with the first-order proximityof the network was defined in Section 3.2.An overview about key technical definitions for these models isprovided in Table 1.The dataset used in the experiments is the co-author networkconstructed from the DBLP dataset [23]. The co-author networkrecords the number of papers co-published by authors. The co-author relation is considered as an undirected edge between twovertices (authors) in the network. Furthermore, the network weconstructed consists of three different research fields: data mining,machine learning, and computer vision. It includes 10 ,
541 authorsand 97 ,
072 edges. Each vertex (author) in the network is labeledaccording to the research areas of papers published by this author.The dimensionality of the embedding vectors is set to 128 for allmodels. Available at http://arnetminer.org/citation
Link prediction tries to predict the missing neighbor v j in an un-observed edge ( v i , v j ) for a given vertex v i of the network or topredict the likelihood of an association between two vertices. Itis worth noting that the proposed models GANE and GANE-O1both have implied answers for link prediction, as the generator istrained to produce the best answer of v j given v i . Therefore, thereis no need to train a binary classifier for link prediction, or to sortthe candidates by a specific metric as most models usually do.To make fair and impartial comparisons, we evaluated the linkprediction task in two aspects as:(1) a binary classification problem by employing the embeddingrepresentations learned in models, and(2) a ranking problem by scoring the pair of vertices which isrepresented as a low-dimensional vector. For binary classification evalu-ation, we used the Multilayer Perceptron (MLP) [20] classifier totell positive or negative samples. We randomly sampled differentpercentages of the edges as the training set and used the rest as thetest set for the evaluation.In the training stage, the observed edges in the network wereused as the positive samples, and the same size of unobservededges were randomly sampled as negative samples. The embeddingrepresentations of two vertices of an edge were then concatenatedas the input to the MLP classifier.In the test stage, records in the test set were fed into the classifier.Table 2 reports the accuracy of the binary classification achievedby different models. The results show that our models (both GANEand GANE-O1) outperform all the baselines consistently and sig-nificantly given different training-test cuttings. Moreover, they arequite robust/insensitive to the size of the training set in comparisonwith other approaches. Both GANE and GANE-O1 perform betterthan IRGAN which demonstrates the effectiveness of the adoptionof Wasserstein-1 distance to GAN models.LINE-(O1+O2) has the best performance in the three variationsof LINE, as it explores both first-order proximity and second orderproximity which are the representative structures in the co-authornetwork as suggested in [22]. For the models explicitly adopt thefirst order proximity, GANE-O1 performs better than LINE-O1. Ourguess is that the proposed generative adversarial framework fornetwork embedding is capable of capturing and preserving morecomplex network structures implicitly.It is worth noting that GANE shows its full strength on thelink prediction task even if it is simpler than GANE-O1 without
ANE: A Generative Adversarial Network Embedding Conference’17, July 2017, Washington, DC, USA
Table 3: Ranking Performance Comparison for Link Prediction
Metric P@1 P@3 P@5 P@10 MAP R@1 R@3 R@5 R@10 R@15 R@20LINE-O1 0.0185 0.0378 0.0378 0.0326 0.0812 0.0120 0.0754 0.1230 0.2100 0.2694 0.3203Improve 563.78% 456.88% 470.37% 361.96% 310.47% 453.33% 361.14% 399.92% 283.24% 215.70% 170.96%LINE-O2 0.0 0.1124 0.1247 0.0921 0.2073 0.0 0.2483 0.4554 0.6409 0.6973 0.7278Improve N/A 87.28% 72.89% 63.52% 60.78% N/A 40.03% 35.02% 25.57% 21.97% 19.25%LINE-(O1+O2) 0.0 0.0928 0.0905 0.0650 0.1535 0.0 0.1971 0.3128 0.4323 0.4917 0.5282Improve N/A 126.83% 138.23% 131.69% 117.13% N/A 76.41% 96.58% 86.17% 72.97% 64.31%IRGAN 0.0231 0.1554 0.1665 0.1160 0.2681 0.0102 0.3311 0.5898 0.7750 0.8193 0.8406Improve 431.60% 35.46% 29.49% 29.83% 24.32% 550.98% 5.01% 4.26% 3.85% 3.81% 3.25%GANE 0.0 0.1864
Improve N/A 12.93% -2.36% -5.76% 11.92% N/A 12.89% -1.40% -4.86% -4.58% -4.75%GANE-O1 P r e c i s i o n @ IRGANGANEGANE-O1
Figure 3: Training Curves considering the relationship between vertices in the network. Thismay be attributed to the fact that the co-author network is quitesparse.
The multi-relational network embed-ding approaches, such as TransE[5] and TransH[29], usually utilizethe metric, i.e. || h + r − t || where h, r and t denotes the representationvectors for head, relation and tail respectively, to select out a benchof candidates for the ranking in link prediction. Unfortunately, thesingle-relational network embedding, which is discussed in thepaper, usually utilizes binary-classifier to determine the results oflink prediction as shown in Sec. 4.1.1 as there is no metric directlyavailable as a ranking criterion. Thus, a pool of candidates cannotbe provided for some special tasks, e.g., aligning the users acrosssocial networks[15].Alternatively, we used the probability of the existence for anedge, which can be implicitly computed by the network embeddingmodel, to evaluate the ranking for candidate selection. And we use Data Mining Machine Learning Computer Vision0.00.20.40.60.81.0 C l u s t e r i n g A cc u r a c y . . . . . . .
424 0 .
346 0 . .
428 0 . . . .
422 0 . .
513 0 . . LINE-O1LINE-O2 LINE-(O1+O2)IRGAN GANEGANE-O1
Figure 4: Performance Comparison On Clustering. the records, which had never appeared in the network embeddingtraining process, as the test set. Fig. 2 depicts the training sets andtest sets we used in each task (embedding, classification, ranking).A pair of vertices is scored by tracking the optimization directionof each model, which are detailed in Table 1. Technically, we utilizedthe inner product ( u Ti u j ) of two vertex vectors as the scoringcriteria since σ ( u Ti u j ) constitutes the main part of the probabilityof the existence for an edge and the sigmoid function is strictlymonotonically increasing. Then, we ranked all pairs ( v i , v j ) , j = , , ..., | V | for a given v i based on the score. We used precision[30],recall[30] and Mean Average Precision (MAP) [4] to evaluate theprediction performance.Table 3 shows the ranking performance for all models. Our mod-els (GANE and GANE-O1) outperform others again in terms ofall evaluation metrics. Surprisingly, GANE-O1 provides quite im-pressive prediction @1 whereas the other models present ratherunpleasant results. onference’17, July 2017, Washington, DC, USA Huiting Hong, Xin Li and Mingzhong Wang (a) GANE (b) GANE-O1 (c) IRGAN(d) LINE-O1 (e) LINE-O2 (f) LINE-(O1+O2) Figure 5: Visualization of Co-author Network.
Even if both IRGAN and GANEs are based on GAN models,GANEs constantly have better performance than IRGAN. Moreover,GANE and GANE-O1 converge rapidly in comparison with IRGAN.Fig. 3 illustrates P@3 along with the number of iterations increasing.This may be accounted for the application of Wasserstein-1 distance.
We first investi-gated the quality of embedding representations in an intuitive wayby visualization. PCA[11] was adopted to facilitate dimension re-duction. In this paper, we selected three components obtained byPCA to visualize vertices of the network in 3-D space. The resultingvisualizations with different embedding models are illustrated inFig.5. Only the visualizations in GANE and GANE-O1 present arelatively clear pattern for the labeled vertices where the authorsdevoted to the same research area are clustered together. GANEperforms the most favorable layout in terms of clear clusteringpattern. LINE variations perform not well as they require a ratherdense network for the model training.
We applied k-means [16] to clusterall vertices in the low-dimensional vector space and set the numberof clusters as 3. We utilized majority vote to label the three clustersas : "data mining", "machine learning", or "computer vision". Then,we quantitatively computed the accuracy of the vertices beingclustered for each cluster, which is defined as the proportion of the“accurately” clustered vertices to the size of the cluster. Fig.4 illustrates the clustering accuracy achieved by differentmodels on each cluster. Again, GANE produces the best accuracywhich is consistent with the visualization. We argue that GANEcan effectively preserve the proximities among vertices in the low-dimensional space.In summary, our GANEs (GANE and GANE-O1) achieve the bestperformance for both link prediction and clustering tasks whichdemonstrates the strength of the generative adversarial framework.The first-order proximity intentionally adopted in GANE-O1 doesnot help to significantly boost the embedding performance as seenfrom the comparison between GANE and GANE-O1. We thinkthat purposely to preserve the handcrafted structures may lead theembedding to overlook other underlying latent complex structuresin the network, as it is impossible for us to explore all structuresin conventional methods. However, GANE may provide a wayto explore the underlying structures as complete as possible byincorporating link predictions as a component of the generativeadversarial framework.
This paper proposes a generative adversarial framework for net-work embedding. To the best of our knowledge, it is the first attemptto learn network embedding in a generative adversarial manner.We present three variations of solutions, including GANE whichapplies cosine similarity, GANE-O1 which preserves the first-orderproximity, and GANE-O2 which tries to preserves the second-order
ANE: A Generative Adversarial Network Embedding Conference’17, July 2017, Washington, DC, USA proximity of the network in the low-dimensional embedded vectorspace. Wasserstein-1 distance is adopted to train the generator withimproved stability. We also prove that GANE-O2 has the sameobjective function as GANE-O1 when negative sampling is appliedto simplify the training process in GANE-O2.Experiments on link prediction and clustering tasks demonstratethat our models constantly outperform state-of-the-art solutionsfor network embedding. Moreover, our models are capable of per-forming the feature representation learning and link predictionsimultaneously under the adversarial minimax game principles.The results also prove the feasibility and strength of the generativeadversarial models to explore the underlying complex structures ofnetworks.In the future, we plan to study the application of generativeadversarial framework into multi-relational network embedding.We also plan to gain more insight into the mechanisms that GANscan employ to direct the exploration and discovery of underlyingcomplex structures in networks.
REFERENCES [1] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski,and Alexander J. Smola. 2013. Distributed large-scale natural graph factorization.(2013), 37–48.[2] Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for traininggenerative adversarial networks. arXiv preprint arXiv:1701.04862 (2017).[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).[4] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999.
Modern informationretrieval . Vol. 463. ACM press New York.[5] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Ok-sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relationalData. In
Advances in Neural Information Processing Systems 26: 27th Annual Con-ference on Neural Information Processing Systems 2013. Proceedings of a meetingheld December 5-8, 2013, Lake Tahoe, Nevada, United States.
Proceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’15) . ACM, New York, NY, USA,119–128. https://doi.org/10.1145/2783258.2783296[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and PieterAbbeel. 2016. Infogan: Interpretable representation learning by information max-imizing generative adversarial nets. In
Advances in Neural Information ProcessingSystems . 2172–2180.[8] Chris Dyer. 2014. Notes on Noise Contrastive Estimation and Negative Sampling.
CoRR abs/1410.8251 (2014). arXiv:1410.8251 http://arxiv.org/abs/1410.8251[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Genera-tive Adversarial Nets. In
Advances in Neural Information Processing Systems27 , Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Wein-berger (Eds.). Curran Associates, Inc., 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf[10] Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimationof unnormalized statistical models, with applications to natural image statistics.
Journal of Machine Learning Research
13, Feb (2012), 307–361.[11] Harold Hotelling. 1933. Analysis of a complex of statistical variables into principalcomponents.
Journal of educational psychology
24, 6 (1933), 417.[12] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).[13] Hang Li, Haozheng Wang, Zhenglu Yang, and Haochen Liu. 2017. EffectiveRepresenting of Information Network by Variational Autoencoder. In
Proceedingsof the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI2017, Melbourne, Australia, August 19-25, 2017 . 2103–2109. https://doi.org/10.24963/ijcai.2017/292[14] Hang Li, Haozheng Wang, Zhenglu Yang, and Haochen Liu. 2017. EffectiveRepresenting of Information Network by Variational Autoencoder. In
Twenty-Sixth International Joint Conference on Artificial Intelligence . 2103–2109.[15] Li Liu, William K. Cheung, Xin Li, and Lejian Liao. 2016. Aligning Users acrossSocial Networks Using Network Embedding. In
Proceedings of the Twenty-FifthInternational Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY,USA, 9-15 July 2016
IEEE transactions oninformation theory
28, 2 (1982), 129–137.[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems . 3111–3119.[18] Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).[19] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learningof social representations. In
Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 701–710.[20] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985.
Learninginternal representations by error propagation . Technical Report. California UnivSan Diego La Jolla Inst for Cognitive Science.[21] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000.Policy gradient methods for reinforcement learning with function approximation.In
Advances in neural information processing systems . 1057–1063.[22] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.2015. Line: Large-scale information network embedding. In
Proceedings of the24th International Conference on World Wide Web . International World Wide WebConferences Steering Committee, 1067–1077.[23] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Ar-netminer: extraction and mining of academic social networks. In
Proceedings ofthe 14th ACM SIGKDD international conference on Knowledge discovery and datamining . ACM, 990–998.[24] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude.
COURSERA: Neuralnetworks for machine learning
4, 2 (2012), 26–31.[25] CÃľdric Villani. 2009. Optimal transport: old and new. 338 (2009).[26] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Em-bedding. In
Proceedings of the 22Nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’16) . ACM, New York, NY, USA,1225–1234. https://doi.org/10.1145/2939672.2939753[27] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, PengZhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generativeand Discriminative Information Retrieval Models. arXiv preprint arXiv:1705.10513 (2017).[28] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang.2017. Community Preserving Network Embedding. In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco,California, USA.
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City,Québec, Canada.
Machine learning
8, 3-4 (1992), 229–256.[32] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: SequenceGenerative Adversarial Nets with Policy Gradient.. In
AAAI . 2852–2858.[33] Junbo Zhao, Michael Mathieu, and Yann LeCun. 2016. Energy-based generativeadversarial network. arXiv preprint arXiv:1609.03126arXiv preprint arXiv:1609.03126