[PDF] Tutorial on NLP-Inspired Network Embedding

Abstract

This tutorial covers a few recent papers in the field of network embedding. Network embedding is a collective term for techniques for mapping graph nodes to vectors of real numbers in a multidimensional space. To be useful, a good embedding should preserve the structure of the graph. The vectors can then be used as input to various network and graph analysis tasks, such as link prediction. The papers discussed develop methods for the online learning of such embeddings, and include DeepWalk, LINE, node2vec, struc2vec and megapath2vec. These new methods and developments in online learning of network embeddings have major applications for the analysis of graphs and networks, including online social networks.

Full PDF

TTutorial on NLP-Inspired Network Embedding

Boaz Shmueli , Social Networks and Human-Centered Computing, Taiwan InternationalGraduate Program (TIGP)Institute of Information Science, Academia Sinica National Tsing Hua UniversityJanuary 2018

In this tutorial I cover a few recent papers in the ﬁeld of network embedding. Network embeddingis a collective term for techniques for mapping graph nodes to vectors of real numbers in a multidi-mensional space. To be useful, a good embedding should preserve the structure of the graph. Thevectors can then be used as input to various network and graph analysis tasks, such as link pre-diction. The papers I discuss develop methods for the online learning of such embeddings. Thesedevelopments in online learning of network embeddings have major applications for the analysis ofgraphs and networks, including online social networks.Recently, researchers have adapted ideas and techniques from word embeddings to the domainof network embeddings. This tutorial will cover these recent developments. Speciﬁcally, I coverthe following papers:

Title Where, Who, When

DeepWalk : Online Learning of SocialRepresentations KDD Perozzi et al.(2014)LINE: Large-scale Information NetworkEmbedding WWW, Tang et al.(2015)node2vec: Scalable Feature Learning forNetworks KDD, Grover andLeskovec (2016)struc2vec: Learning Node Representa-tions from Structural Identity KDD, Ribeiro et al.(2017)metapath2vec: Scalable representationlearning for heterogeneous networks KDD, Dong et al. (2017)Table 1: Reviewed PapersThe papers use various methods to sample the nodes and create node contexts. Subsequently,machine learning techniques perform the embedding. The ﬁrst paper, “

DeepWalk : Online Learn-ing of Social Representations”, is a seminal paper that generalizes a well-known NLP word em-bedding technique, word2vec , to graph theory. The second and third papers, “LINE: Large-scale a r X i v : . [ c s . S I] O c t igure 1: Paper outlineInformation Network Embedding” and “node2vec: Scalable Feature Learning for Networks”, im-prove upon DeepWalk in a substantial way.The papers were presented at SIGKDD and WWW conferences (2014, 2015, and 2016) and arehighly-cited: 471, 362, and 268 in Google Scholar; 374, 330, and 139 in Microsoft Academic. Thelatter service is reputed to be more reliable.The last two papers reviewed are “struc2vec: Learning Node Representations from StructuralIdentity” and “metapath2vec: Scalable representation learning for heterogeneous networks”.Fig. 1 shows a time outline of the papers reviewed in this tutorial.The papers use word2vec , an algorithm designed for performing word embedding. Thus, I ﬁrstgive a brief introduction to this technique. The original word2vec papers are Mikolov et al. (2013a)and Mikolov et al. (2013b). Since many of the descriptions in these papers are “somewhat crypticand hard to follow” (Goldberg and Levy, 2014), a small cottage industry of pre-prints explaining word2vec has sprung up, including Rong (2014) and Goldberg and Levy (2014). word2vec Traditional natural language processing systems treat words as discrete symbols. Words are thenrepresented as one-hot vectors in a very high dimension space. With a vocabulary of V diﬀerentwords, the space is of dimension V , and each word is represented by a vector of length V with asingle component of 1 and V − components of 0.In contrast, vector space models embed words in a low, d -dimension real-numbers vector spacesuch that words which are semantically similar are represented by closer vectors. This embedding isbased on the principle of co-occurrence: the assumption that words with related semantic meaningsappear close to each other in texts. This assumption is famously summarized by British linguist J.R. Firth’s famous statement that “you shall know a word by the company it keeps” (Firth, 1957). word2vec (Mikolov et al. (2013a), Mikolov et al. (2013b)) is a machine learning model forthe eﬃcient learning of word embeddings given raw text. There are two versions of word2vec ,Continuous Bag-of-Words (CBOW) and Skip-gram. Here I focus on Skip-gram, since this is themodel used by DeepWalk and node2vec .Given a word w we look at its context-words c ∈ C ( w ) using a sliding window of size × k : the k words preceding w , and the k words following w . A typical parameter for this window parameteris k = 5 . When we consider the conditional probabilities p ( c | w ) , the goal is to ﬁnd the parameters2igure 2: word2vec neural network θ of p ( c | w, θ ) such that the following is maximized: arg max θ (cid:89) c ∈ C ( w ) p ( c | w ; θ ) (1)For the entire corpus T we have: arg max θ (cid:89) w ∈ T  (cid:89) c ∈ C ( w ) p ( c | w ; θ )  (2)which we can also write as arg max θ (cid:89) ( w,c ) ∈ D p ( c | w ; θ ) (3)where D is the set of all the ( w, c ) pairs of (word, context-word). In a corpus T with a vocabularyof V words, we have approximately × k × V such pairs.The approach taken by word vec is to parameterize the model using a classical soft-max neuralnetwork: p ( c | w ; θ ) = e v c · v w (cid:80) c (cid:48) ∈ C e v (cid:48) c · v w (4)where v w and v c are the vector representations of the words and the context-words, both in R d .The input layer thus consists of V neurons where each word is represented by a one-hot vector.The output layer has V neurons, one for each context word, and acts as a soft-max classiﬁer. Thehidden layer has d neurons, where d is the size of the embedding space. This is shown in Fig. 2.As is usually the case, the objective function is easier to optimize after taking its log. Thisstep is omitted here for sake of brevity. Following the optimization, the V × d weights of the d neurons of the hidden layer serve as the vector representations. The hidden layer basically servesas an auto-encoder that produces the word embeddings. The outputs from the ﬁnal layer are notneeded.Due to the time complexity of the training (there are V × V × d parameters to be computed),it is impractical to scale the training of the network to very large corpora. The authors describe3wo tricks for better computation eﬃciency. The ﬁrst is hierarchical softmax, an eﬃcient way ofcomputing softmax (Mnih and Salakhutdinov (2008) and Morin and Bengio (2005)). Hierarchicalsoftmax reduces the complexity of computation per training instance per context word from O ( V ) to O (log V ) . The second is negative-sampling, where only a sample of the output vectors areupdated per iteration. These two improvements make it possible to train the model on very largeamounts of text. There is some confusion regarding terminology in this cutting-edge ﬁeld, especially between graphembedding and node embedding . It would perhaps be more consistent to use node embedding forthe mapping of nodes to vectors, in the same way that word embedding is used in NLP. Indeed,some authors use this term. Others, however, use graph embedding. This is confusing since graphembedding is also used to denote the mapping of entire graphs into a single vector (in much thesame way that document embedding maps an entire document into a vector and thus allows to ﬁnddistances between documents). Another factor adding to the confusion is that graph embeddingis a well establish topic in the mathematical ﬁeld of topological graph theory.In this tutorial I chose to use yet another term, network embedding , to describe the mapping ofthe nodes to vectors, as this currently seems to be the most popular term in the machine learningcommunity. DeepWalk word2vec is a huge success, and inspires many derivative works in related ﬁelds. One of the mostintriguing developments is

DeepWalk (Perozzi et al., 2014), a new online learning method forembedding of graphs, with an emphasis on graphs representing social networks.The problem of network embedding can be formalized as follows: given a graph G ( V, E ) with aset V of nodes and a set E of edges, compute a mapping f ( v ∈ V ) −→ R d such that the structureof G is preserved as much as possible. Vast literature in the ﬁeld covers multiple techniques forattacking the problem of network embeddings (Fouss et al., 2016). For example, spectral methods use the eigenvectors of various graph matrices to compute the embeddings, thus oﬀering exactsolutions following a closed-form formulation. Another approach is to use the analogy betweennodes and edges to physical spring networks, resulting in force-directed methods . DeepWalk , on the other hand, gets its inspiration from NLP and generalizes the word2vec model to graphs using an analogy between documents and graphs, as shown in Table 2: a documentis seen as a graph of words. In the word2vec model, words are used to estimate the likelihood of thenearby context-words in the sentence. In much the same way,

DeepWalk uses nodes to estimatethe likelihood of nearby nodes in the graph. To paraphrase J. R. Firth, “you shall know a nodeby the company it keeps”. The nearby nodes are sampled using random walks. The assumption isthat sampling from multiple random walks captures the structure of the graph.The

DeepWalk algorithm generates random walks for each node v ∈ V . Each random walkstarts from the origin node v and then advances to a node uniformly selected from its immediateneighbors. The length of the walks is T . Thus, each such random walk generates an ordered listof nodes: RW ( v ) = ( v = u , u , ..., u T ) . (5)Each walk is a “sentence” of nodes. Similarly for the words in word2vec , for each node u i withinthis “sentence”, the algorithm then looks at a window of k neighboring nodes before and after u j , The ﬁgures in this tutorial are drawn by the author, unless otherwise stated C ( u i ) = { u i − k , ..., u i + k } (6)We would then like to ﬁnd θ that maximizes: arg max θ (cid:89) u ∈ RW ( v ) p ( u (cid:48) ∈ C ( u ) | u ; θ ) (7)For the entire graph G we have: arg max θ (cid:89) v ∈ G  (cid:89) u ∈ RW ( v ) p ( u (cid:48) ∈ C ( u ) | u ; θ )  (8)which we can also write as arg max θ (cid:89) ( u (cid:48) ,u ) ∈ D p ( u (cid:48) | u ; θ ) (9)where D is the set of all (node, context-node) pairs discovered in all the random walks of all thenodes. To improve sampling, γ random walks per node v are performed. γ , T , and k are allhyper-parameters of the algorithm. In total, γ × | V | random walks of length T are generated, andthere are approximately γ × | V | × T × k (node, context-node) pairs.As can be seen, DeepWalk ’s Equation (9) is equivalent to node2vec

Skip-gram’s Equation (3),and indeed the same neural network is used to compute the embeddings for the nodes, with somemandatory adjustments (for example, the input layer is the one-hot representations of the nodes).

The original motivation for

DeepWalk is solving the following social graph multi-label classiﬁca-tion problem: we are given a graph G ( V, E ) of a social network and node attributes X ∈ R | V |× S (where S is the size of the feature space). There is also a set of labels Y . Some of the nodes arelabeled with y ∈ Y . The task is to predict the labels of the other (unlabeled) nodes. This problem is known as a relational classiﬁcation problem . Traditional approaches use thegraph structure for classiﬁcation. In this paper, we ﬁrst learn the embedding of the nodes X E ∈ R | V |× d , and these are then treated as additional features. X E is thus combined with the attributes X as input to any standard classiﬁcation algorithm.The authors compared the performance of DeepWalk with ﬁve other baseline methods:• Spectral Clustering of the normalized graph Laplacian (Tang and Liu, 2011)• Modularity Matrix (Tang and Liu, 2009a)• k -means clustering of the adjacency matrix (Tang and Liu, 2009b), In fact, the paper mentions that the label output is Y ∈ R | V |×Y . I found this confusing, if not wrong, since thelabels are not real numbers but belong to a discrete set of labels. DeepWalk is a one-vs-rest logistic regression, and theparameters selected are d = 128 (number of dimensions), T = 50 (length of random walks), w = 10 (window size), γ = 80 (number of random walks per node).Three test datasets are used for the multi-label classiﬁcation task: BlogCatalog , Flickr ,and

YouTube . The results show that

DeepWalk has strong performance that is almost as goodas or exceeds the leading method, Spectral Clustering. In addition, it needs relatively few labelsto perform well. It also has the advantage that it is computationally feasible to perform thisembedding on huge networks such as

YouTube , which is computationally unfeasible for spectralclustering.

While the analogy of graphs and texts is useful, networks do not possess the linear propertyof text. So while the neighborhood of a word can be quite accurately sampled using a slidingwindow, social and other large networks are not linear, and thus sampling their structure is morechallenging. Following the introduction of

DeepWalk , many other researchers have started touse similar NLP-inspired methods for network embedding. One of the problems with

DeepWalk is that it uses unbiased random walks for generating the node contexts. In that way, it is similarto a depth-ﬁrst search (DFS). The work by Tang et al. (2015) tries to solve this issue by preservingﬁrst-order and second-order node proximities.The ﬁrst-order proximity of two nodes is deﬁned to be the weight of the edge between them (1for unweighted graphs), and zero if they do not share an edge. For example, nodes 6 and 7 in Fig.3 should be embedded closely since they are connected by a “heavy” edge (compared to nodes 7and 8, for example). Figure 3: First-order and second-order proximitiesThe second-order proximity is deﬁned by the common neighborhood of two nodes. Referringback to Fig. 3, nodes 5 and 6 have a high second-order proximity since they share a lot of neighbors,even thought they are not directly connected. Thus, they should also be embedded closely togetherin the embedding space.In mathematical terms, the ﬁrst order proximity between any two nodes v i and v j is deﬁnedby LINE to be the following joint probability: p ( v i , v j ) = 11 + exp ( − v i T · v j ) where v is the vector representation of node v in R d . Ideally, this probability would be equal to6he empirical probability induced by the edge weights: ˆ p ( v i , v j ) = w ij W where w i j is the weight of the edge between the two nodes and W is a normalization factor.The embedding should try and minimize the distance between the two distributions p ( · , · ) and ˆ p ( · , · ) . LINE uses KL-divergence to measure distance between distributions, and so we get thatthe distance to optimize is: O = − (cid:88) ( i,j ) ∈ E w ij log( p ( v i , v j )) For the second-order proximity, the authors choose the neighbors of each node to provide “context”in the word2vec sense. The development for this is quite tedious, so we will skip it here. The readeris referred to Section 4.1.2 in Tang et al. (2015). The result is the following objective function forthe second order proximity: O = − (cid:88) ( i,j ) ∈ E w ij log( p ( v j | v i )) Where, similarly to the Skip-gram model, the conditional probability between two nodes is givenby: p ( v j | v i ) = exp( v j (cid:48) · v i ) (cid:80) | V | k =1 exp( v (cid:48) Tk · v i ) The second-order proximity thus deﬁnes a conditional probability p ( v j | v i ) over all other contextnodes v k . In LINE, the embedding for the ﬁrst-order and second-order proximities (i.e. maximizing for O and O ) is done separately. Solving for O is done using asynchronous stochastic gradient algorithm(ASGA) with various performance optimization tricks, including edge sampling. The vectors fromthe two models are then concatenated for each node. As the authors note, this approach is notvery “principled”, and a better approach that would minimize O and O simultaneously is moreappropriate.LINE is compared against baseline methods for performing various tasks. The data includes aLanguage Network from the entire English Wikipedia, Social Networks (Flickr and YouTube), andCitation Networks. Algorithms for comparison included graph embedding using matrix factoriza-tion, DeepWalk , and various variations of the LINE algorithm. The results show clearly that theLINE network embedding that include both ﬁrst-order and second-order proximities outperformall other methods in classiﬁcation tasks. node2vec Following

DeepWalk and LINE, Grover and Leskovec (2016) had the insight that a better captureof a node’s neighborhood can be achieved by carefully biasing the random walks. This leads tolatent representations that better capture the network structure. node2vec indeed achieves excellentperformance in multiple social graph tasks.Assuming that we want a sample of k nodes from the neighborhood of each node. Two extremestrategies are Breadth-First Sampling (BFS) and Depth-First Sampling (DFS). BFS samples theimmediate k neighbors of the node, and thus helps in understanding its local structure. DFS7amples k increasingly distant nodes, and thus identiﬁes a more global community structure. Arich sample that combines the properties of BFS and DFS is the intuition behind the creation ofthe biased random walk.Let us recall that the random walk proposed by DeepWalk had the following uniform proba-bility of advancing from node u i to node u i +1 : P ( u i +1 = v | u i = w ) = (cid:40) d ( w ) if ( v, w ) ∈ E otherwise (10)where d ( w ) is the degree of node w .In contrast, the random walk proposed by node2vec has two parameters that control the walk, p and q . Assume that in the random walk we just advanced from node t to node v . There are threepossibilities for the next node in the walk: we can (1) go from v back to t , (2) advance to a thirdnode x , a common neighbor of both t and v , or (3) advance to a third node x , a neighbor of v butnot of t . The parameters p and q control the probabilities for each of these types of transitions.More speciﬁcally, assuming identical edge weights, the unnormalized transition probability fromnode v to node x , having arrived from node t , is: π tvx =  p if d ( t, x ) = 01 if d ( t, x ) = 1 q if d ( t, x ) = 2 (11)where p is the return parameter , which controls the likelihood of the walk to backtrack to theprevious node; and q is the in-out parameter , which decides whether to favor nodes closer to t .See Fig. 4 for an example of the transition properties. By controlling these two parameters,the randoms walks can achieve a balance of the beneﬁts of both BFS and DFS, and thus moreaccurately represent the local and global graph properties. The rest of the exposition in GroverFigure 4: Unnormalized transition probabilities from node v , after just transitioning from node t and Leskovec (2016) follows similar path to DeepWalk , i.e., it uses the word2vec neural networkmodel with a hidden layer that calculates the latent vector representations. Optimization is doneusing stochastic gradient descent (SGD). The algorithm also borrows word2vec ’s negative samplingtrick to achieve scaling to graphs with millions of nodes.

In addition to biased random walks, one of the interesting contributions of node2vec is the ex-tension of feature learning to edges. For this goal, a binary operator ◦ is deﬁned on the vectorrepresentations f ( v ) and f ( u ) of any two nodes v and u , with the aim of generating a vector8epresentation of the pair ( u, v ) such that g ( u, v ) : V × V −→ R d (cid:48) . Several choices for this binaryoperators where d = d (cid:48) are shown in Table 3. The operator is deﬁned whether or not the edgeactually exists in the graph, paving the way for the use of this representation in a link predictiontask. Operator DeﬁnitionAverage f i ( u )+ f i ( v )2 Hadamard f i ( u ) ∗ f i ( v ) Weighted-L1 | f i ( u ) − f i ( v ) | Weighted-L2 | f i ( u ) − f i ( v ) | Table 3: Diﬀerent operators showing the i th component of g ( u, v ) The authors test node2vec with two tasks. The ﬁrst is a multi-label classiﬁcation task, similarto the one used with

DeepWalk . In this task, the node feature representations are input to aone-vs-rest logistic regression classiﬁer. Three datasets are tested:

BlogCatalog (10,312 nodes,39 diﬀerent labels);

Protein-Protein Interaction (PPI) (3,890 nodes, 50 diﬀerent labels);and

Wikipedia (4,777 nodes, 40 diﬀerent labels).The baseline algorithms tested against are:•

DeepWalk • Spectral Clustering of the normalized graph Laplacian (Tang and Liu, 2011)• LINEWith the right selection of p and q , node2vec outperforms all other contenders. The performancegain for BlogCatalog and Wikipedia is a staggering 22%.In addition to the classiﬁcation task, the authors also experiment with a link prediction taskusing the various edge features operators that are described above. Three datasets are tested: Facebook (4,039 nodes; 88,234 edges);

PPI (19,706 nodes; 390,633 edges); arXiv (18,722 nodes;198,110 edges). They compare node2vec performance against the baseline algorithms using thediﬀerent operators in Table 3, as well as against the standard link prediction scores: CommonNeighbors, Jaccard’s coeﬃcient, Adamic-Adar Score, and Preferential Attachment. The node2vec method with the Hadarmard operator outperforms all the other methods, in some cases withimpressive improvements. struc2vec Another interesting network embedding work is struc2vec (Ribeiro et al., 2017), which focuses onthe role of nodes in a network. Nodes in networks have speciﬁc roles. These roles can be identiﬁedthrough structural identity. For example, in an airport network, some nodes serve as hubs. Twohubs can be far away (hop-wise), but still have structural similarity. In social networks, roles ofusers are can also be identiﬁed by the structure. For example, in a graph representing a companystructure, mid-level managers have a typical structure within the graph. Consider the graph inFig. 5. The red and green nodes are structurally equivalent. They belong to an isomorphism. Theblue and black nodes, while not structurally equivalent , are structurally similar . struc2vec is a framework for representations based on structural similarity. The goal of struc2vec is to preserve the identity of the nodes’ structure when projecting them into Euclidean space, even if9igure 5: Node roles in graphsthey are not close. Embeddings such as DeepWalk and node2vec capture neighborhood relations.However two nodes that are structurally similar but very distant will not be close in the vectorspace. In addition, nodes that are close to each other in the graph can be structurally dissimilar,and thus should not be close in the Euclidean space. This is the problem that struc2vec is tryingto solve. In other words, the embedding done by the former methods depends on the hop distancebetween nodes. struc2vec does not take into account this distance.To compute the representations, struc2vec builds a special graph - the context graph - thatrepresents structural similarities between nodes. The goal is to create a context graph where nodesare close to each other if they are structurally similar. Once the context graph is constructed, theembedding is done once again using the word2vec algorithm. That is, random walks are performedin the context graph in order to build "sentences", followed by online learning using word2vec ’sSkip-gram algorithm. Thus, the main contribution of struc2vec is the construction of the contextgraph.Given the original graph G ( V, E ) with diameter K , the context graph M is a multi-layer graphwith K + 1 layers. Each layer includes all the nodes in G . Within each layer, weighted edgesrepresents structural similarity between nodes. Edges also exist between the corresponding nodesof each layer. We will now describe the four steps used in struc2vec . Our emphasis is on theconstruction of the context graph, since this is the main contribution of the paper. The ﬁrst step looks at the structural similarity between nodes. For each node v , we look at the N k ( v ) , which is the set of nodes which are k -distant from u ( N ( v ) = v ). These form “rings” aroundeach node. For each such ring, we look at the ordered degree sequence DS ( N k ( v )) of the nodesparticipating in the ring.Referring again to Fig. 5, let’s focus for example on the black node. When k = 0 , N ( black ) = { black } , and its degree is 4, thus DS ( N ( black )) = (4) . Moving to k = 1 , we look at the nodeswhose distance to black is . We see that: N ( black ) = { red, blue, green, yellow } and their degrees are 4, 3, 4, and 1, respectively. Thus, the ordered degree list is: DS ( N ( black )) = (4 , , , Once we compute DS ( N k ( v )) ( k = 0 , , ..., K ) , for all v , we can measure the structural sim-ilarities between every pair of nodes u, v by comparing their degree sequences. To measure thedistance between two degree sequences A = DS ( N k ( u )) and B = DS ( N k ( v )) , and noting thatthese sequences are not necessarily of the same length, the authors suggest using Dynamic TimeWarping (DTW), a technique to match elements of two sequences of diﬀerent lengths. The tech-nique minimizes the sum of the distances between matched elements a ∈ A and b ∈ B . The10istance between two elements is designed in struc2vec so that node degrees of 1001 and 1002are more similar than node degrees of 1 and 2 by using (cid:0) max ( a, b ) /min ( a, b ) (cid:1) − as the distancebetween the two matched degrees in the sequences.Now that the distance between two degree sequences is deﬁned, the structural distance betweentwo nodes is deﬁned hierarchically as: f k ( u, v ) = f k − ( u, v ) + g (cid:0) DS ( N k ( u )) , DS ( N k ( v )) (cid:1) .g is the distance function between two degree sequences as described above. Now that we deﬁned a structural distance between nodes, we can move on to build the contextgraph. The context graph M is a multilayer graph with K layers. Each layer is a complete graphconsisting of all the nodes u ∈ V in the original graph G . Thus, each node u ∈ V is represented in M by K + 1 nodes u k ( k = 0 , , ..., K ), one in each layer.To illustrate, Fig. 6 shows the ﬁrst three layers of the context graph M for our example graph G . Each of the nodes in the original graph G is represented in each of the layers in M . The dashedarrows show how the red node is replicated in each layer. Other nodes are similarly replicated.Within the k th layer, the weight of an edge between nodes u k and v k is a function of the structuraldistance f k ( u, v ) . Thus, each layer is a complete sub-graph with weighted, undirected edges thatcorrespond to the structural similarity between the nodes.Assuming u k and v k in layer k are the nodes representing u and v , the weight of their connectingedge is calculated as follows: w ( u k , v k ) = e − f k ( u,v ) , k = 0 , , ..., K Nodes that are structurally similar will have larger weights within the multiple layers of M .Edges also exist between the diﬀerent layers, but they are directed and exist only between thecorresponding nodes u , ..., u k , u k +1 that represent the same node u in G . Each node is linked tothe corresponding node in the layer just above (for k < K ) and just below (for k > ). Thesedirected edges are also weighted. The weights between the same node in diﬀerent layers is givenby: w ( u k , u k +1 ) = log (Γ k ( u ) + e ) , k = 0 , ..., K − (12) w ( u k , u k − ) = 1 , k = 1 , ..., K (13)where Γ k ( u ) is the number of edges connected to u k whose weight is larger than the average edgeweight in layer k , and is thus a measure of the number of similar nodes in layer k . If u k has manysimilar nodes, the weight going to the upper level will be larger. In a higher level, the number ofsimilar nodes can only decrease.Fig. 6 shows these up/down links between the various representations of the red node in themulti-layer graph. Similar links exist also for the other nodes, but they are not shown in the ﬁgure. Once the context graph M is constructed, the rest of the procedure is similar to DeepWalk .Random walks are generated in M to create the context of nodes. A hyperparamter q determineswhether the walk will change a layer or stay within the layer. The weights of the edges determine11igure 6: Node roles in graphsthe probabilities of advancing to the next nodes. With probability q the walk stays within thesame node, and in that case the probability of moving from u to v is given by p k ( u, v ) = e − f k ( i,v ) Z k ( u ) (14) Z k ( u ) is a normalization factor. Thus, in each step, the walk will prefer to walk to nodes whichare structurally similar. With probability − q , the walk will move up or down a layer to itscorresponding node in layer k − or k + 1 according to the weights given in (12) and (13). In struc2vec the Skip-gram approach is used by generating sets of independent and relatively shortrandom walks in M . Multiple random walks are generated for each node, starting with layer .These are the “sentences”. These sentences are then used as input to the word2vec algorithm totrain a neural network and learn the latent representing of the nodes by maximizing the probabilityof nodes within a context. M is a huge graph. It has ( K + 1) n nodes and ( K + 1) (cid:0) n ( n − (cid:1) + 2 nK edges . To reduce thetime to generate and store the multi-layer graph and context for nodes, the authors propose threeoptimizations: The paper has a typo in these numbers, which I veriﬁed by corresponding with one of the authors (Savarese).In section (3.2) of the paper the number of nodes in M is listed as Kn nodes and the number of edges is K (cid:0) n ( n − (cid:1) + 2 n ( K − . Thus, the authors mistakenly use K layers instead of K + 1 layers.

12 Reduce the length of degree sequences by compressing them into a list of 2-tuples ( d, count ) ,which means that the sequence has count nodes of degree d for each such tuple.• Reduce the number of edges in the multilayer graph (only log ( n ) neighbors per node).• Reduce the number of layers in the multilayer graph.These optimizations enable the struc2vec algorithm to scale quasi-linearly, and the authors wereable to analyze networks with millions of nodes.Figure 7: Mirrored Karate network (from Ribeiro et al. (2017) The authors ran diﬀerent experiments which demonstrated the superiority of struc2vec for thetask. In one of the experiments, they created a graph composed of two copies of Zachary’s Karatenetwork (Zachary, 1977), connected by a single edge, as shown in Fig. 7. The result of the struc2vec embedding was compared to

DeepWalk and node2vec . A result of the embedding is shown in Fig.8. Since the Karate network was duplicated, the resulting mirror network has two identical nodesfor each role. These was captured by struc2vec , as can be seen by the pairs of nodes embeddedclosely in part (c) of the ﬁgure. The top embeddings, for (a)

DeepWalk and (b) node2vec do notcapture the structural equivalence and mostly focus on the graph distances. metapath2vec Finally, we look into recent work by Dong et al. (2017) which embeds heterogeneous networks.In a heterogeneous graph nodes can represent diﬀerent entities. A classic example is an academicnetwork, where nodes can represent researchers, organizations, papers and conference venues, andedges represent various relations between the entities. For example, a paper is connected to itsauthor(s), and also to the conference venue where it was presented. Fig. 9 shows an mock academicnetwork.The methods we have seen so far (

DeepWalk , LINE, word2vec ) assume a homogeneous net-work, where there is only one kind of node (e.g., a person in a social network). The contextsgenerated for the nodes, and the resulting embedding, do not take into account the diﬀerent typesof nodes and relationships between them. To solve this problem, the authors suggest using a methodcalled metapath2vec , where random walks will be biased by using meta-paths. A meta-path is a13igure 8: Comparison of Node Representations (from Ribeiro et al. (2017)14igure 9: An example heterogeneous academic network, with four types of nodes, from left toright: Organizations (O), Authors (A), Papers (P), and Conference Venues (V)predeﬁned composite relation between nodes. For example, in the context of an academic network,the speciﬁc relation Author-Paper-Author in the network deﬁnes the notion of co-authorship. Twoauthors that are connected in this way are co-authors of the same paper (note that authors canalso be connected via an organization, but that implies a diﬀerent relationship). The idea of themeta-path is not new, but here the authors are using it to create random walks, thus generatingcarefully biased contexts for nodes.This work again uses the Skip-gram algorithm, with the necessary adjustments for heteroge-neous network. We are now looking to optimize arg max θ (cid:88) v ∈ V (cid:88) t ∈ T V (cid:88) c t ∈ N t ( v ) log p ( c t | v ; θ ) (15)where N t ( v ) are the neighbors of v of the t th type. p ( c t | v ; θ ) is deﬁned as the softmax function.As mentioned, the random walks must follow the meta-paths that are hand-designed for thespeciﬁc network and task. Examples of meta-paths for the academic network are given in Fig.10. For example, A-P-A denotes the Author-Paper-Author path. Thus, each speciﬁc meta-pathsemantic creates a bias toward speciﬁc relations. Some meta-paths can be long, for example O-A-P-V-P-A-O (Organization-Author-Paper-Venue-Paper-Author-Organization). These meta-pathsare then used to create biased random walks. I.e., the random walks must follow the semanticsdictated by the various prescribed meta-paths. For example for A-P-A, the path with start with anauthor, then choose a paper (at random), then another author (at random). Again, this contextsare fed into a Skip-gram-like neural network for the ﬁnal embedding. The authors compared the metapath2vec performance to

DeepWalk , LINE, and node2vec , as wellas to PTE (Predictive Text Embedding). The data used for the task consisted of two academicnetworks, AMiner Computer Science and DBIS (Database and Information Systems). AMinercontains more than 9 million authors and 3 million papers from thousands of venues. DBIS ismuch smaller. metapath2vec shows impressive results on tasks involving heterogeneous networks, including15igure 10: Metapaths for the academic networkFigure 11: PCA projections of the 128D embeddings of 16 top CS conferences and correspondinghigh-proﬁle authors (Dong et al., 2017) 16igure 12: 2D t-SNE projections of the 128D embeddings of 48 CS venues, three each from 16sub-ﬁelds. (Dong et al., 2017)visualization, classiﬁcation, and clustering. In Fig. 11, PCA is used to project the embeddingsof venues and top authors done by the various methods. The embeddings by

DeepWalk (a)and PTE (b) clustered the authors and venues, but failed to create a meaningful relation betweenthem. metapath2vec and metapath2vec+ (a variant of metapath2vec ) shows a consistent relationshipbetween each author and its ﬁeld.Fig. 12 demonstrates how the relations between nodes of a certain type also beneﬁts from metapath2vec , as the projection of venues naturally lends itself to conferences in the same ﬁeldbeing embedded close to each other. Performance in a multi-class classiﬁcation task also showedsuperior performance compare to the baseline methods.

Methods of network embedding, which are based on research in word embeddings, are described. By ﬁnding an analogy between documents and graphs, machine learning methods from NLP aresuccessfully generalized to graphs. The methods,

DeepWalk , LINE, node2vec , etc. are able toprocess very large-scale graphs. Online learning of the latent vector representations of nodes areshown to have superior performance in various tasks, including multiple-label classiﬁcation andedge prediction. Network embedding is an appropriate target for machine learning, since there area multitude of underlying patterns in the graphs that are non-trivial to detect programmatically.A major strength of these network embeddings is the ability to use well-developed data miningand other statistical algorithm (e.g., classiﬁcation, prediction) for performing network tasks, insteadof running discrete, path-, node-, or edge-based graph algorithms. The computational optimizationare also an important advantage, as these methods scale to millions of nodes. Some of the issueswhich still need to be improved upon are the selection of hyper-parameters, and, perhaps most Curiously, the Perozzi et al. (2014) paper never mentions the term network embedding or graph embedding ,instead focusing on online learning and representation . The Grover and Leskovec (2016) paper mentions nodeembeddings only once. This perhaps hints at disconnect between diﬀerent research communities. word2vec (2013) → Deep-Walk (2014) → LINE (2015) → node2vec (2016), also demonstrates how quickly ideas propagatewithin the machine learning community, perhaps a testament to the success of the short-cycleconferences publication ecosystem. I also note that the code, data, and datasets for both the DeepWalk and node2vec papers are available online. Thus, the methods and results can betested, compared, reproduced, and improved upon. I see this as an essential asset for advancingresearch in this ﬁeld. 18 eferences

Dong, Y., Chawla, N. V., and Swami, A. (2017). metapath2vec: Scalable representation learningfor heterogeneous networks. In

Proceedings of the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , pages 135–144. ACM.Firth, J. (1957).

Papers in linguistics, 1934-1951 . Oxford University Press.Fouss, F., Saerens, M., and Shimbo, M. (2016).

Algorithms and models for network data and linkanalysis . Cambridge University Press.Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.’s negative-samplingword-embedding method. arXiv preprint arXiv:1402.3722 .Grover, A. and Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In

Proceedingsof the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining ,pages 855–864. ACM.Macskassy, S. A. and Provost, F. (2003). A simple relational classiﬁer. Technical report, NEWYORK UNIV NY STERN SCHOOL OF BUSINESS.Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Eﬃcient estimation of word represen-tations in vector space. arXiv preprint arXiv:1301.3781 .Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed repre-sentations of words and phrases and their compositionality. In

Advances in Neural InformationProcessing Systems , pages 3111–3119.Mnih, A. and Salakhutdinov, R. R. (2008). Probabilistic matrix factorization. In

Advances inneural information processing systems , pages 1257–1264.Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language model. In

Aistats , volume 5, pages 246–252.Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: Online learning of social representations.In

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 701–710. ACM.Ribeiro, L. F., Saverese, P. H., and Figueiredo, D. R. (2017). struc2vec: Learning node repre-sentations from structural identity. In

Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 385–394. ACM.Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 .Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. (2015). Line: Large-scale informationnetwork embedding. In

Proceedings of the 24th International Conference on World Wide Web ,pages 1067–1077. International World Wide Web Conferences Steering Committee.Tang, L. and Liu, H. (2009a). Relational learning via latent social dimensions. In

Proceedingsof the 15th ACM SIGKDD international conference on Knowledge discovery and data mining ,pages 817–826. ACM.Tang, L. and Liu, H. (2009b). Scalable learning of collective behavior based on sparse social dimen-sions. In

Proceedings of the 18th ACM conference on Information and knowledge management ,pages 1107–1116. ACM.Tang, L. and Liu, H. (2011). Leveraging social media networks for classiﬁcation.

Data Mining andKnowledge Discovery , 23(3):447–478. 19achary, W. W. (1977). An information ﬂow model for conﬂict and ﬁssion in small groups.