[PDF] Evaluating Node Embeddings of Complex Networks

Abstract

Graph embedding is a transformation of nodes of a graph into a set of vectors. A~good embedding should capture the graph topology, node-to-node relationship, and other relevant information about the graph, its subgraphs, and nodes. If these objectives are achieved, an embedding is a meaningful, understandable, compressed representations of a network that can be used for other machine learning tools such as node classification, community detection, or link prediction. The main challenge is that one needs to make sure that embeddings describe the properties of the graphs well. As a result, selecting the best embedding is a challenging task and very often requires domain experts. In this paper, we do a series of extensive experiments with selected graph embedding algorithms, both on real-world networks as well as artificially generated ones. Based on those experiments we formulate two general conclusions. First, if one needs to pick one embedding algorithm before running the experiments, then node2vec is the best choice as it performed best in our tests. Having said that, there is no single winner in all tests and, additionally, most embedding algorithms have hyperparameters that should be tuned and are randomized. Therefore, our main recommendation for practitioners is, if possible, to generate several embeddings for a problem at hand and then use a general framework that provides a tool for an unsupervised graph embedding comparison. This framework (introduced recently in the literature and easily available on GitHub repository) assigns the divergence score to embeddings to help distinguish good ones from bad ones.

Full PDF

EE VALUATING N ODE E MBEDDINGS OF C OMPLEX N ETWORKS

A P

REPRINT

Arash Dehghan-Kooshkghazi ∗ Bogumił Kami ´nski † Łukasz Krai ´nski ‡ Paweł Prałat § François Théberge ¶ February 17, 2021 A BSTRACT

Graph embedding is a transformation of nodes of a graph into a set of vectors. A good embeddingshould capture the graph topology, node-to-node relationship, and other relevant information aboutthe graph, its subgraphs, and nodes. If these objectives are achieved, an embedding is a meaningful,understandable, compressed representations of a network that can be used for other machine learningtools such as node classiﬁcation, community detection, or link prediction. The main challenge isthat one needs to make sure that embeddings describe the properties of the graphs well. As a result,selecting the best embedding is a challenging task and very often requires domain experts.In this paper, we do a series of extensive experiments with selected graph embedding algorithms,both on real-world networks as well as artiﬁcially generated ones. Based on those experiments weformulate two general conclusions. First, if one needs to pick one embedding algorithm beforerunning the experiments, then node2vec is the best choice as it performed best in our tests. Havingsaid that, there is no single winner in all tests and, additionally, most embedding algorithms havehyperparameters that should be tuned and are randomized. Therefore, our main recommendationfor practitioners is, if possible, to generate several embeddings for a problem at hand and then usea general framework that provides a tool for an unsupervised graph embedding comparison. Thisframework (introduced recently in the literature and easily available on GitHub repository) assignsthe “divergence score” to embeddings to help distinguish good ones from bad ones. K eywords embedding, complex networks The goal of many machine learning applications is to make predictions or discover new patterns using graph-structureddata as feature information. For example, one might want to better understand the role of a particular researcher withina collaboration network, similarity between users interacting on Amazon or Yelp, classify proteins in a biologicalinteraction network, or recommend new users or products to users of some social media. As a result, the study ofnetworks has emerged in diverse disciplines as a means of analyzing these complex relational data. Capturing aspectsof a complex system as a graph can bring physical insights and predictive power [1, 2].Network Geometry is a rapidly developing approach in Network Science [3] which further abstracts the system bymodelling the nodes of the network as points in a geometric space. There are many successful examples of this approachthat include latent space models [4], and connections between geometry and network clustering and communitystructure [5, 6]. Very often, these geometric embeddings naturally correspond to physical space, such as when modelling ∗ Department of Mathematics, Ryerson University, Toronto, ON, Canada; e-mail: [email protected] † Decision Analysis and Support Unit, SGH Warsaw School of Economics, Warsaw, Poland; e-mail: [email protected] ‡ Decision Analysis and Support Unit, SGH Warsaw School of Economics, Warsaw, Poland; e-mail: [email protected] § Department of Mathematics, Ryerson University, Toronto, ON, Canada; e-mail: [email protected] ¶ Tutte Institute for Mathematics and Computing, Ottawa, ON, Canada; email: [email protected] a r X i v : . [ c s . S I] F e b PREPRINT - F

EBRUARY

17, 2021wireless networks or when networks are embedded in some geographic space [7, 8]. See [9] for more details aboutapplying spatial graphs to model complex networks.In order to extract useful structural information from graphs, one might want to try to embed it in a geometric space byassigning coordinates to each node such that nearby nodes are more likely to share an edge than those far from eachother. In particular, in the case of link prediction, a good embedding should have the property that most of the network’sedges can be predicted from the coordinates of the nodes. On the other hand, in the case of node classiﬁcation, onemight want to include information about the global position of a node in the graph or the structure of the node’s localgraph neighbourhood. Other applications might require different properties to be preserved. As a result, there aremany embedding algorithms (based on techniques from linear algebra, random walks, or deep learning) and the listconstantly grows. Moreover, many of these algorithms have various parameters that can be carefully tuned to generateembeddings in some multidimensional spaces, possibly in different dimensions. Hence, unfortunately, in the absence ofa general-purpose representation for graphs, graph embedding very often requires domain experts to craft features or touse specialized feature selection algorithms.The main questions we try to answer in this research project are:

How do we evaluate these embeddings? Which one isthe best and should be used for a given task at hand?

In order for algorithms for a speciﬁc application to perform well,it is crucial to feed them with an appropriate embedding. Indeed, the well-known concept in data science / machinelearning—“garbage in, garbage out” (GIGO)—says that ﬂawed or nonsense input data produces nonsense output.Hence, the impact of this project propagates to many important machine learning problems.In order to answer these questions, we perform a detailed study of a number of popular graph embedding algorithms: node2vec , VERSE , LINE , Deep Walk , HOPE , and

SDNE (see Section 2). We evaluate embeddings using a generalframework that assigns the “divergence score” to each embedding which, in an unsupervised learning fashion, distin-guishes good from bad embeddings (see Section 3). We start with experiments on real-world networks (see Section 4)but in order to understand how basic statistics affect the quality of embeddings we also perform a series of tests onsynthetic graphs generated by the

ABCD model, similar to the well-known and widely used

LFR (see Section 5). Inparticular, the summary of our experiments and conclusions for practitioners can be found in Sub-section 5.8.Finally, let us mention that evaluating embedding algorithms is a subjective task. Our experiments are based on the“divergence score” that proved to be a useful tool in a number of applied projects we were personally involved with butclearly we are biased. In order to convince readers without prior experience with the framework, we ﬁnish the paperwith a few experiments to show that there is a strong correlation between the “divergence score” and the quality ofthe selected machine learning tools that use embeddings as an input: classiﬁcation, community detection, and linkprediction (see Section 6). We hope that after reading the paper the conclusion will be apparent: it is best to generate afew embeddings (possibly using various algorithms and suitably tuned parameters) and then use the benchmarkingframework to select the best candidate. This approach is especially recommended in unsupervised learning contexts,for example, anomaly or community detection. For supervised machine learning tools, one may alternatively directlyevaluate the quality of candidate embeddings for a given task at hand. However, in the latter scenario the learning stepis often computationally expensive; the framework then may be used to narrow the set of embeddings to evaluate downto a smaller family of potentially useful embeddings.

There are over 100 algorithms proposed in the literature for node embeddings which are based on various approachessuch as random walks, linear algebra and deep learning [10]. Moreover, many of these algorithms have variousparameters that can be carefully tuned to generate embeddings in some multidimensional spaces, possibly in differentdimensions. For our experiments, we selected 6 popular algorithms that span different families. All but one of them(

VERSE ) are taken from the OpenNE framework .The ﬁrst two algorithms, Deep Walk [11] and node2vec [12], are based on random walks performed on the graph. Thisapproach was successfully used in

Natural Language Processing ( NLP ); for example the

Word2Vec algorithm [13]is based on the assumption that “words are known by the company they keep”. For a given word, embedding is achievedby looking at words appearing close to each other as deﬁned by context windows (groups of consecutive words). Forgraphs, the nodes play the role of words and “sentences” are constructed via random walks. The exact procedure howone performs such random walks differs for the two algorithms we selected. https://github.com/xgfs/verse/ https://github.com/thunlp/OpenNE PREPRINT - F

EBRUARY

17, 2021In the

Deep Walk algorithm, the family of walks is sampled by performing random walks on graph G , typicallybetween 32 and 64 per node, and for some ﬁxed length. The walks are then used as sentences. For each node v i , thealgorithm tries to ﬁnd an embedding e i of v i that maximizes the approximated likelihood of observing the nodes in itscontext windows obtained from generated walks, assuming independence of observations. We set all parameters to theirdefault values, namely, number of walks: 10, walk length: 80, workers: 8, window size: 10.In node2vec , biased random walks are deﬁned via two main parameters. The return parameter ( p ) controls thelikelihood of immediately revisiting a node in the random walk. Setting it to a high value ensures that we are less likelyto sample an already-visited node in the following two steps. The in-out parameter ( q ) allows the search to differentiatebetween inward and outward nodes so we can smoothly interpolate between breadth-ﬁrst-search (BFS) and depth-ﬁrstsearch (DFS) exploration. We set all parameters to their default values, namely, number of walks: 10, walk length: 80,workers: 8, window size: 10, p: 1, q: 1.There are several deep learning methods successfully used for embedding nodes in a graph. One of those, StructuralDeep Network Embedding ( SDNE ) [14], is an autoencoder , a type of artiﬁcial neural network that is a commonlyused deep learning model for representing complex objects such as images. The goal is to represent objects of interest inlower dimension in such a way that the original object can be reconstructed as best as possible from its low dimensionalvector representation. Autoencoders are trained to minimize reconstruction errors (such as squared errors), oftenreferred to as the loss function.

SDNE aims at preserving both the ﬁrst and the second order proximity: ﬁrst orderproximity is derived directly from weights of the edges while the second order indicates similarity between nodes’neighbourhoods. We changed the number of neurons at each encoder layer from its default value of 1000 to 128, as itwas consistently producing very poor results. The remaining parameters were set to their default values, namely, alpha:1e-6, beta: 5, Nu1 (l1-loss of weights in autoencoder): 1e-5, Nu2 (l2-loss of weights in autoencoder): 1e-4, batch size:200, learning rate: 0.01.Several embedding algorithms are based on linear algebra. The

High Order Proximity preserved

Embedding algorithm (

HOPE ) [15] is aimed at embedding nodes in directed graphs, but can also be used for undirected graphs.For every node v i , we deﬁne two embeddings, e s,i and e t,i , the source and, respectively, the target embedding. Let E s and E t be the corresponding matrices of the source and the target embeddings. The loss function for HOPE , for agiven proximity matrix S , is deﬁned as follows: Φ( E s , E t ) = || S − E Ts E t || F , where || · || F is the Frobenius norm that is a natural and straightforward extension of the Euclidean norm to matrices.There are several choices for the proximity measure matrix S such as Katz similarity, common neighbours, or Adamic-Adar. We used the default proximity measure, common neighbours.The next algorithm, Large-scale Information Network Embedding (LINE) [16], is an efﬁcient method for nodeembedding which explicitly deﬁnes two functions to encode the ﬁrst and the second order proximity. In order to capturethe ﬁrst order proximity, the joint probability distribution is deﬁned for a pair of nodes based on their embeddings.The method is similar for the second order proximity. In this case, each node v i is assigned with a source and a targetembedding vectors, e s,i and e t,i , and the conditional probability distribution is considered for a target of a random edgesampled from the set of edges having one endpoint in v i . We set all parameters to their default values, namely, batchsize: 1000, epoch: 10, negative ratio: 5, order: 3, label ﬁle: no labels used, CLF ratio: 0.5, auto save: true.Finally, VERtex Similarity Embeddings (VERSE) [17] is a simple, versatile, and memory-efﬁcient method thatderives graph embeddings explicitly calibrated to preserve the distributions of a selected node-to-node similaritymeasure. It is a general framework that learns any similarity measures among nodes via training a simple, yet expressive,single-layer neural network. This includes popular similarity measures such as personalized PageRank, SimRank,and adjacency similarity. We used the default proximity measure, personalized PageRank. We also set all remainingparameters to their default values, namely, alpha: 0.85, learning rate: 0.0025, threads: 4, nsamples: 3.

Evaluating graph embedding algorithms is a challenging task. This subjective process depends on a speciﬁc applicationof the embedding at hand, and typically requires ad-hoc experiments and tests performed by the domain experts.However, in the recent papers [18, 19], the “divergence score” was proposed that can be assigned to outcomes of theembedding algorithms to help distinguish good ones from bad ones. This general framework provides a tool for anunsupervised graph embedding comparison and is available at the GitHub repository . https://github.com/KrainskiL/CGE.jl PREPRINT - F

EBRUARY

17, 2021In order to justify why we use the framework in our experiments and to build an intuition, let us try to answer thefollowing related question:

What do we expect from a good embedding?

One natural and desired property is to requirethat based on a good embedding one should be able to predict most of the network’s edges from the coordinates of thenodes in the embedded space. One typically expects that if two nodes are far away from each other, then the chance theyare adjacent in the graph is smaller compared to another pair of nodes that are close to each other. But, of course, in anyreal-world network there are some sporadic long edges and some nodes that are close to each other are not adjacent.Due to this fact, in the framework we use an embedding algorithm is not considered good when it only pays attention tolocal properties such as the existence of particular edges (microscopic point of view) but rather the expectation is that itis able to capture some global properties such as the number of edges induced by some relatively large subsets of nodes(macroscopic point of view). So, how one may evaluate if the global structure is consistent with our expectations andintuition without considering individual pairs of nodes?The approach that is proposed works as follows. First, some dense parts of the graph need to be identiﬁed by a goodgraph clustering algorithm. By default, the framework uses the

Ensemble Clustering algorithm for Graphs ( ECG )which is based on the classical Louvain algorithm and the concept of consensus clustering [20]. This algorithm isknown to have good stability but the choice of graph clustering algorithm is ﬂexible and it was empirically veriﬁed thatit does not affect the outcome of the process as long as the set of nodes is partitioned into clusters such that there aresubstantially more edges captured within clusters than between them. The clusters that are found provide the desiredmacroscopic point of view of the graph. Note that for this task only information about the graph G is used; in particular,the embedding is not used at all. We then consider the graph from a different point of view. Using the GeometricChung-Lu ( GCL ) model, based on the degree distribution of the graph and the embedding, we compute the expectednumber of edges within each cluster found earlier, as well as between them. The embedding is scored by computing adivergence score between these expected number of edges and the actual number of edges present in the graph.In order to see the framework “in action”, we perform a small experiment with the well-known College Footballreal-world network with known community structures. This graph represents the schedule of United States footballgames between Division IA colleges during the regular season in Fall 2000 [21]. The data consists of 115 teams (nodes)and 613 games (edges). The teams are divided into conferences containing 8–12 teams each. In general, games aremore frequent between members of the same conference than between members of different conferences, with teamsplaying an average of about seven intra-conference games and four inter-conference games in the 2000 season. Thereare a few exceptions to this rule, as detailed in [22]: one of the conferences is really a group of independent teams, oneconference is really broken into two groups, and 3 other teams play mainly against teams from other conferences. Werefer to those as outlying nodes, which we represent with a distinctive triangular shape.In order to illustrate the application of the framework, we ran various embedding algorithms in different dimensionsand sets of parameters on the Football dataset. In Figure 1, we show the best and worst scoring embeddings based onthe divergence score. The colours of nodes correspond to the conferences, and the triangular shaped nodes correspondto outlying nodes as observed earlier. The communities are very clear in the left plot while in the right plot, only a fewcommunities are clearly grouped together.Figure 1: The

College Football Graph : we show the best (left) and the worst (right) scoring embedding.To visualize the embeddings in high dimensions we needed to perform dimension reduction that seeks to produce a lowdimensional representation of high dimensional data that preserves relevant structure. We used the

Uniform Manifold https://github.com/ftheberge/graph-partition-and-measures PREPRINT - F

EBRUARY

17, 2021

Approximation and Projection ( UMAP ) [23], a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. It provides apractical scalable algorithm that applies to real world datasets.Finally, let us make a comment that not all embeddings proposed in the literature try to capture an information aboutedges. Some algorithms indeed try to preserve edges whereas others care about some different structural properties; forexample, they might try to map together nodes with similar functions or role within the network. Having said that, manyimportant applications a data scientist needs to deal with in everyday work require preserving (global) edge densitiesand the framework favours embeddings that do a good job from that perspective. We come back to this discussion inSection 6 and justify using the framework more.

Let us start with evaluating the performance of the selected graph embedding algorithms run on a few real-worldnetworks. For our experiments, we selected four networks. Before we brieﬂy describe these datasets, let us present afew of their statistics.

Property Mouse Brain Airports Email-EU Github

Nodes 1029 464 986 37700Edges 1700 7595 16017 288996Density 0.00321 0.07071 0.03298 0.0041Maximum degree 153 175 342 9458Minimum degree 1 1 1 1Average degree 3.304 32.737 32.489 15.331Assortativity -0.215 -0.055 -0.025 -0.075Number of triangles 0 100358 104395 523782Global clustering coefﬁcient 0 0.476 0.266 0.012Maximum k -core 5 50 34 34Number of components 20 2 1 1Diameter 12 7 7 11Average path length 4.913 2.455 2.588 3.246Table 1: Some statistics of the four networks we experimented with. Mouse Brain Graph This graph represents the mouse brain. Nodes represent regions of the brain and edges represent neuronal ﬁber tractsthat connect one node to another. One interesting feature of this graph (which, in fact, was the main reason to select thisgraph for our experiments) is that it contains no triangles, something that is very rare in general. Indeed, many socialnetworks (and to a lesser degree other networks) exhibit relatively large clustering coefﬁcient which can be described asthe overall probability for the network to have adjacent nodes interconnected, thus revealing the existence of tightlyconnected communities (or clusters, subgroups, cliques). This network has the clustering coefﬁcient equal to zero andso it will allow us to understand the effect of this graph parameter on the quality of the embedding algorithms.

Airports Graph This graph contains information about ﬂights between airports based on a record of more than 3.5 million US DomesticFlights from 1990 to 2009. It has been taken from OpenFlights website which have a huge database of differenttravelling mediums across the globe. The nodes are represented by the 3-letter airport codes; the latitude and thelongitude as well as the state and the city are also available. The edges are directed with weights representing the totalvolume of passengers between the two airports. https://github.com/lmcinnes/umap http://networkrepository.com/bn-mouse-kasthuri-graph-v4.php https://github.com/ftheberge/GraphMiningNotebooks/tree/master/Datasets/Airports PREPRINT - F

EBRUARY

17, 2021

GitHub Graph This graph is a large social network of GitHub developers which was collected from the public API in June 2019. Nodescorrespond to developers who have starred at least 10 repositories and edges represent mutual follower relationshipsbetween them. The node features are extracted based on the location, repositories starred, employer and e-mail address.In particular, the set of nodes was partitioned into web developers and machine learning developers, feature derivedfrom the job title of each user. As a result, this network is suitable for experiments on binary node classiﬁcation—onemight want to predict whether the GitHub user is a web or a machine learning developer. We ignore this partition in ourexperiments and work with the entire graph.

Email-EU Graph The network was generated using email data from a large European research institution. Emails are anonymized andthere is an edge between u and v if person u sent person v at least one email. The dataset does not contain incomingmessages from or outgoing messages to the rest of the world. More importantly, it contains “ground-truth” communitymemberships of the nodes indicating which of 42 departments at the research institute individuals belong to. As a result,this dataset is suitable for experiments aiming to detect communities but we ignore this external knowledge in ourexperiments. Let us present the general approach that we use to test each of the four networks we selected to experiment with. For agiven embedding algorithm A and a given dimension d ∈ { , , , , , } , we independently run the algorithm30 times. This is done to not only measure how good the algorithms are but also their stability. Recall that 4 out of 6algorithms we test are randomized. In order to do that, we compute the average divergence score a A,d and the standarddeviation s A,d .The results are presented in Figures 2–5 in the form of a heat-map: for each algorithm A ( y axis) and each dimension d ( x axis), the corresponding square is presented in light colour if the divergence score a A,d is small (that is, theembedding scores well according to the benchmark framework), and dark colours are used if the divergence score islarge (that is, the embedding does not score well). The same approach is used to visualize the behaviour of the standarddeviation s A,d . A l g o r i t h m s Average of Divergence Scores A l g o r i t h m s Variance of Divergence Scores

Figure 2:

Mouse Brain Graph : Average and Standard Deviation of the Divergence Score (Heat Map)The results are generally improving when the dimension increases, with the exception of the

VERSE algorithm that,surprisingly, seems to perform better in lower dimensions. Dimensions 4 and 8 are too small to capture the “shape”of the networks that is manifested via relatively large divergence scores obtained for such embeddings. In all fournetworks, node2vec consistently generated embeddings that scored the best with

VERSE (in lower dimensions) takingthe second place.

HOPE end

SDNE did not perform very well which might suggest that such algorithms do not aim topreserve global densities (property that the benchmarking framework tries to evaluate) but some other aspects of the https://github.com/benedekrozemberczki/MUSAE https://snap.stanford.edu/data/email-Eu-core-temporal.html PREPRINT - F

EBRUARY

17, 2021 A l g o r i t h m s Average of Divergence Scores A l g o r i t h m s Variance of Divergence Scores

Figure 3:

Airports Graph : Average and Standard Deviation of the Divergence Score (Heat Map) A l g o r i t h m s Average of Divergence Scores A l g o r i t h m s Variance of Divergence Scores

Figure 4:

GitHub Graph : Average and Standard Deviation of the Divergence Score (Heat Map) A l g o r i t h m s Average of Divergence Scores A l g o r i t h m s Variance of Divergence Scores

Figure 5:

Email-EU Graph : Average and Standard Deviation of the Divergence Score (Heat Map)7

PREPRINT - F

EBRUARY

17, 2021network. The score of the

DeepWalk algorithm varies a lot from network to network; it is very bad for the AirportGraph but performs well on the Email-EU Graph. With regards to stability,

SDNE seems to be the least stable of thealgorithms we tested and node2vec wins one more time—it seems to be not only consistently good but also quite stable.

As commonly done in the literature as well as in the applied world, we analyze how the selected embedding algorithmsperform on artiﬁcially constructed networks with communities (see, for example, [22]). By doing this we may ﬂexiblychange the characteristics of the network (such as its size, the number of clusters, the degree distribution, etc.) and assessthe impact of these changes on the results. A popular example of such network generator is the

LFR benchmark proposedby Lancichinetti, Fortunato, and Radicchi [24] that produces synthetic graphs resembling real world graphs. For ourexperiments, we use an alternative random graph model, namely, the A rtiﬁcial B enchmark for C ommunity D etection( ABCD ) graph [25]. In both benchmarks, the size of each community is drawn from a power-law distribution, as isthe degree of each node. As a result, both benchmarks produce graphs with similar properties. The main reason forusing ABCD instead of

LFR is that the mixing parameter µ , the main parameter of the LFR model guiding the strengthof the communities, has a non-obvious interpretation and so can lead to unnaturally-deﬁned networks. Another reasonis that

ABCD is faster than

LFR and can be easily parallelized. Moreover, due to its simplicity, it is possible to analyzeit theoretically. Such results, despite the fact that often asymptotic in nature, may shed some light on the behaviour ofembedding algorithms on real-world networks. We leave it for future work and concentrate on experiments.

The

ABCD model has a number of parameters that can be independently tuned and so it is suitable for testing whichproperties of real networks affect the quality of embedding algorithms. In order to do that, we ﬁx all parameters but oneand then investigate how sensitive the algorithms are with respect to the selected parameter. Of course, it might be thecase that the quality of a given algorithm depends on some speciﬁc combination of parameters but such more subtlecorrelations are more challenging to detect and so it is left for a future research. Here are the parameters we want toinvestigate as well as their default values.•

Size of the network : n is the number of nodes in the graph. The default value is n = 10 , .• Degree distribution : γ is the (negative) exponent of the power-law degree distribution. The default value is γ = 2 . . The default degree sequence is generated in advance and used for all experiments that use the defaultsettings of n = 10 , , and ∆ ≈ n / ( γ − (the maximum degree). It is generated with γ = 2 . , δ = 5 (theminimum degree) and ∆ = 464 ≈ n / ( γ − = n / . Of course, if n or ∆ changes, then the degree sequencehas to be re-generated but it is then used for all experiments with that choice of parameters n and ∆ .• Maximum degree : ∆ is the maximum degree in the graph. The default value is ∆ ≈ n / ( γ − whichcorresponds to the so-called natural cut-off . This speciﬁc value ensures that the expected number of nodes ofdegree at least ∆ is close to 1.• Level of noise : ξ is the mixing parameter that controls the fraction of edges between communities. Essentially,this parameter may be viewed as the amount of noise in the graph. In one extreme case, if ξ = 0 , then all theedges are within communities. On the other hand, if ξ = 1 , then communities are not present in the graph andedges are simply wired randomly, regardless of the assignment of nodes into communities. The default valuefor our experiments is ξ = 0 . .• Community sizes : β is the (negative) exponent of the distribution of community sizes. In order to test otherparameters in a rather easy set-up, instead of generating the sequence of community sizes randomly, by defaultwe simply consider 5 large communities with the following distribution: 30%, 25%, 20%, 15%, 10%. Moreimportantly, this choice reduces the contribution to the variance from the ABCD model and, as a result, theexperiments concentrate on the stability of the embedding algorithms used instead of the random graph model.We discuss it more in Subsection 5.7.There are a few other parameters of the

ABCD model that we do not investigate as they should not substantially affectthe behaviour of the embedding algorithms. As mentioned above, the minimum degree is set to be δ = 5 , the minimumand the maximum community sizes are set to be 50 and 1000, respectively. The conﬁguration model was used togenerate underlying graphs with the global variant of the model. For more details about the ABCD model we direct thereader to [25]. https://github.com/bkamins/ABCDGraphGenerator.jl PREPRINT - F

EBRUARY

17, 2021Before we move on to the more detailed experiments investigating each parameter of the model independently, let usrepeat the experiment we did for the four selected real-world networks in the previous section on a single instance ofthe

ABCD graph. This graph was generated with all parameters of the model set to their default values except the onecontrolling the community sizes that was ﬁxed to β = 1 . . The results of this experiment are presented in Figure 6. A l g o r i t h m s Average of Divergence Scores A l g o r i t h m s Variance of Divergence Scores

Figure 6:

ABCD : Average and Standard Deviation of the Divergence Score (Heat Map)The conclusion is similar to what we observed in experiments with real graphs in the previous section. However,embeddings in lower dimensions (4 and 8) seem to be even more challenging than before and

VERSE is not performingas good as before with

DeepWalk taking its second place.

In this section, we present a general approach that is used to test the ﬁve parameters mentioned above. Some speciﬁcmodiﬁcations, if necessary, are explained below. In order to test how a given parameter of the

ABCD model affectsthe divergence score, we pick (cid:96) values of this parameter to test (typically, (cid:96) = 10 ) and assign default values to theremaining parameters.Note that there are two sources of randomness involved in the process, one coming from the graph generation processand the second one coming from the embedding algorithm (indeed, 4 out of 6 algorithms we selected for testing arerandomized algorithms). In order to investigate which source of randomness affects the divergence score more, in eachexperiment we generate (cid:96) ABCD graphs. For a given value of the tested parameter x , we generated a family of 10graphs F x , independently sampled by the model but with the same set of parameters.For a given algorithm A , a given dimension d ∈ { , , } , and a given parameter x , we independently run thealgorithm on the 10 graphs from F x , 10 times for each graph. We compute the average divergence score a A,d ( x ) andthe standard deviation s A,d ( x ) (over experiments; 10 graphs, 10 embeddings per graph). In order to see howthe quality of embeddings change for a given graph, for each G ∈ F x , we additionally compute the average score a A,d ( x, G ) , and the standard deviation s A,d ( x, G ) (over experiments; 10 embeddings of G ).For each experiment, the variance is decomposed into two components related to the two sources of randomness basedon the classical ANOVA method (a statistical test generalizing the t -test ). The total sum of squared residuals SS T canbe viewed as the sum of graph-speciﬁc sum of squares SS G and embedding-speciﬁc sum of squares SS E . In the tablesbelow, we report SS T as well as the ratio r E based on the two decomposition elements, namely, r E = SS E SS T . Clearly, ≤ r E ≤ . Values of r E close to zero indicate that the noise coming from the graph generation is signiﬁcantly largerthan the noise related to the embeddings. On the other hand, values close to . imply that both sources of randomnesscontribute roughly equally to the variance. Let us also note that two of our embedding algorithms (namely, LINE and

HOPE ) are deterministic and so e E = 0 for these algorithms.Finally, note that SS T can be alternatively decomposed into (cid:96) pieces corresponding to the (cid:96) values of the parameter x tested. A natural question is then to see if all pieces equally contribute to SS T or maybe only a few of them contributein a non-negligible way. To answer this questions we computed the average correlation between the correspondingratio r E and parameter x . We got very small correlations, namely, . ( n ), − . ( γ ), − . ( ∆ ), . ( β ), − . ( ξ ). 9 PREPRINT - F

EBRUARY

17, 2021The results of our experiments are presented in the following form.Plot 1. For each embedding algorithm A , we plot 3 functions a A,d ( x ) for the selected dimensions d ∈ { , , } as a function of the tested parameter x . We additionally display conﬁdence bands: a A,d ( x ) ± s A,d ( x ) .Plot 2. For each dimension d ∈ { , , } , we plot 6 functions a A,d ( x ) for all embedding algorithms A as afunction of the tested parameter x . We additionally display conﬁdence bands: a A,d ( x ) ± s A,d ( x ) .Plot 3. For each embedding algorithm A , we plot average s A,d ( x, G ) (over 10 graphs) for the 3 selected dimensions d ∈ { , , } as a function of the tested parameter x .Plot 4. For each dimension d ∈ { , , } , we plot average s A,d ( x, G ) (over 10 graphs) for all 6 embeddingalgorithms A as a function of the tested parameter x .Plot 5. For each embedding algorithm A , we generate one plot as follows. For all values of the tested parameter x , alldimensions d , and all graphs G ∈ F x (300 points), plot ( a A,d ( x, G ) , s A,d ( x, G )) . This way we test whethergraphs with large divergence scores produce more variable results.Now, we are ready to discuss results of the experiments. For convenience and easier comparison, all plots are showntogether at the end of this section. n ) In this experiment, we study how sensitive the evaluated embedding algorithms are with respect to n , the size of thenetwork. We consider (cid:96) = 10 different values of the corresponding parameter: n ∈ { , , . . . , } . Theremaining parameters are set to their default values. (Note that, in particular, the exponent of the power-law degreedistribution is set to γ = 2 . and the maximum degree is set to ∆ = n / ( γ − = n / . However, since the degreedistribution is a function of n , it has to be re-generated for each tested value of n .) The plots are presented on Figure 7,the decomposition of the variance can be found in Table 2 and the speed of the embedding algorithms is reported inTable 3. Table 2: Size of the Network ( n ) — Decomposition of the Variance: r E / SS T (in − ) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32 n ) — the Average Running Time (in Seconds) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32

81 86 89 15 9 53

84 97 91 16 10 67

92 126 94 18 10 98The divergence score is rather stable as n increases, that is, it does not drastically change with the size of the network(Plots 1 and 2 on Figure 7). This is, of course, a desired property which indicates that the embeddings capture the “bigpicture” of the network, focusing on its structure and topology. node2vec and VERSE (in low dimension) perform bestwhereas

SDNE appears to be the worst one. Let us observe that

DeepWalk and

LINE ( HOPE to some degree too)improve their quality after increasing the dimension whereas large dimension (128), somewhat surprisingly, hurts theremaining algorithms. Stability of the embedding algorithms is also indicated by relatively small standard deviationswith the exception of

SDNE which performs not so well from that perspective (Plots 3 and 4). Finally,

DeepWalk , node2vec , and VERSE exhibit a positive correlation between the average divergence score and the correspondingstandard deviation (Plot 5). Again, somewhat surprisingly,

SDNE does not behave as expected and shows no correlationbetween the average divergence score and its standard deviation.The decomposition of the variance shows that the contribution from the two sources of randomness is comparable. Interms of speed,

SDNE and

HOPE are the fastest, followed by

VERSE . There is also a slight increase of the runningtime with respect to the dimension; with the most visible difference for

VERSE and

DeepWalk .10

PREPRINT - F

EBRUARY

17, 2021 γ ) In this experiment, we investigate the behaviour of embedding algorithms for different degree distributions. We consider (cid:96) = 10 different values of the corresponding parameter: γ ∈ { . , . , . . . , . } . The remaining parameters are set totheir default values. (However, despite the fact that the maximum degree is set to ∆ ≈ n / ( γ − , it is a function of γ which changes in this experiment; in particular, ∆ decreases when γ increases.) The plots are presented on Figure 8, thedecomposition of the variance can be found in Table 4 and the speed of the embedding algorithms is reported in Table 5.Table 4: Degree Distribution ( γ ) — Decomposition of the Variance: r E / SS T (in − ) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32 γ ) — the Average Running Time (in Seconds) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32

175 206 90 58 15 102

180 236 92 63 16 126

202 316 97 70 16 195Let us ﬁrst note that the global density of the graph as well as the maximum degree ∆ decrease as γ increases. Hence,it seems natural to expect that the divergence score should be getting worse (that is, increasing) for large values of γ , as sparser random graphs are less “predictable” and so more challenging to embed (for example, there might besome unusually sparse or dense regions that occur by pure randomness). However, this behaviour is present only for SDNE and

HOPE ; in particular, the quality of

DeepWalk visibly improves for large values of γ (Plots 1 and 2 inFigure 8). This might mean that DeepWalk has a problem with embedding nodes of large degree. As before, node2vec and

VERSE (in low dimension) perform best whereas

SDNE appears to be the worst one. We also consistently seethe peculiar property that some algorithms (such as

VERSE and

SDNE ) perform worse in higher dimension.

SDNE continues to be unstable with large values of the standard deviation (Plots 3 and 4) and with no correlation between theaverage value and the standard deviation (Plot 5).The decomposition of the variance remains comparable.

SDNE continues to be the fastest,

HOPE slows down incomparison to the earlier experiment with the graph sizes, and

LINE speeds up; however, the order of the algorithmsremains the same. The dimension continues to slow down the algorithms but no visible change can be detected; fromthe complexity point of view, the results are consistent with the ones we discussed in the previous section. ∆ ) In this experiment we investigate the maximum degree ∆ = n x by considering (cid:96) = 10 different values of parameter x ∈ { . , . , . . . , . } . The remaining parameters are set to their default values. The plots are presented onFigure 9, the decomposition of the variance can be found in Table 6 and the speed of the embedding algorithms isreported in Table 7.Table 6: Maximum Degree ( ∆ ) — Decomposition of the Variance: r E / SS T (in − ) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32 ∆ increases. Hence, from that perspective,we experience a similar behaviour as with decreasing γ . Hence, it seems natural to expect that the divergence scoreshould behave similarly to the previous experiment with the degree distribution that is modelled by parameter γ (that is,11 PREPRINT - F

EBRUARY

17, 2021Table 7: Maximum Degree ( ∆ ) — the Average Running Time (in Seconds) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32

176 209 90 58 14 97

181 236 92 62 15 125

199 312 98 71 15 207increasing functions should be now decreasing and vice versa). This behaviour is certainly present for

DeepWalk ; asobserved before, the quality of

DeepWalk visibly drops (especially in low dimension) when large degree nodes arepresent suggesting that they create a problem for this particular embedding algorithm. Other algorithms do not showsimilar duality between the two experiments. It implies (perhaps not surprisingly) that the qualities of these algorithmscannot be simply deduced based on the density of the graph or the maximum degree. It seems that the quality dependsin some non-trivial way on the degree distribution (Plots 1 and 2 in Figure 9). Global comparison of the algorithms(ranking, decomposition of the variance, and speed) remains the same as in the previous subsection. ξ ) In this experiment we investigate the level of noise by considering (cid:96) = 10 different values of the correspondingparameter: ξ ∈ { . , . , . . . , . } . The remaining parameters are set to their default values. The plots are presented onFigure 10, the decomposition of the variance can be found in Table 8 and the speed of the embedding algorithms isreported in Table 9.Table 8: Level of Noise ( ξ ) — Decomposition of the Variance: r E / SS T (in − ) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32 ξ ) — the Average Running Time (in Seconds) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32

175 211 88 59 15 97

178 241 93 63 16 135

204 303 99 68 16 191The level of noise present in the graph is an important aspect of the embedding algorithms which is conﬁrmed by thisexperiment. For low level of noise (modelled by small value of parameter ξ ), communities are easy to identify and toextract from the graph and so it should be relatively easy to embed the graph preserving the community structure. As aresult, one would expect all algorithms to score well and have small divergence score for such values of ξ . On the otherhand, for values of ξ close to one, the graph is very close to the random graph with a given degree distribution andno communities. For such graphs, no matter how nodes are embedded in space, the densities between “communities”and within them are going to be very close to the corresponding expected values in the null-model. Hence, suchgraphs should score well again but the interpretation is different: all algorithms perform predictably bad, given thisimpossible task of preserving community structure. On the other hand, graphs with values of ξ between zero and oneare challenging to properly embed and so one would expect that the divergence score generates “inverted- v shape” asa function of ξ . node2vec and VERSE exhibit such shape and, again, these two algorithms win again (

VERSE getsworse for ξ ∈ { . , . , . } ). Oddly, SDNE , HOPE and

LINE perform badly for very low level of noise (Plots 1and 2 in Figure 10). This could possibly be explained by too local nature of such algorithms. Local algorithms, in thepresence of low level of noise, embed nodes based on the knowledge coming exclusively from their correspondingcommunities. As a result, communities are embedded almost independently and so such algorithms do a poor job ofseparating the communities. The behaviour of

DeepWalk is even more challenging to explain.It is worth pointing out that one striking difference between Plots 2 in Figure 10 and the corresponding plots for theother parameters tested is that there is no visible difference between the three dimensions that we evaluated. This12

PREPRINT - F

EBRUARY

17, 2021indicates that the dimension affects the divergence score by a great deal. The dimension still affects the performance ofthe algorithms but its inﬂuence is much weaker than the parameter ξ tested in this experiment. Let us also note thatthere is a strong correlation between the average score and the standard deviation (Plot 5) but SDNE continues to be themost unstable (Plots 3 and 4).The decomposition of the variance remains comparable to the earlier experiments.

SDNE continues to be the fastestand

DeepWalk was the slowest. The dimension continues to slow down the algorithms but no visible change can bedetected. Let us also mention that during this experiment we were forced to switch to more powerful machines as

HOPE with dimension 128 ran out of memory which indicates that large level of noise is not only challenging from thequality of the embeddings point of view but also from the computational one. β ) In this experiment we investigate the distribution of community sizes by considering (cid:96) = 10 different values of thecorresponding parameter: β ∈ { . , . , . . . , . } . The remaining parameters are set to their default values. The plotsare presented on Figure 11, the decomposition of the variance can be found in Table 11 and the speed of the embeddingalgorithms is reported in Table 12.Table 10: Distribution of Communities as a Function of Parameter ββ Statistics 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0No. communities

34 38 39 42 48 44 56 60 65 64

Min community size

50 50 50 50 50 50 50 50 50 50

Max community size

983 977 979 968 977 997 978 939 972 988

Mean community size β ) — Decomposition of the Variance: r E / SS T (in − ) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32 β ) — the Average Running Time (in Seconds) AlgorithmDim node2vec DeepWalk LINE HOPE SDNE VERSE32

169 217 87 60 14 99

182 239 91 62 15 134

206 297 102 68 16 188Note that the number of communities increases (and so the average community size decreases) when β increases—seeTable 10. Since a graph with large number of small communities is difﬁcult to embed, one would expect that thedivergence score should get worse for large values of β . All algorithms we tested conﬁrm this intuition. DeepWalk , node2vec , and VERSE perform better in smaller dimension which suggests that it should be the choice for graphs witha large number of communities. The divergence score for the remaining three algorithms is not affected by the choiceof the dimension.

VERSE , which used to perform well before, gets worse,

DeepWalk improves, but node2vec is stillwinning (Plots 1 and 2 in Figure 11).

SDNE continues to be unstable but still is the fastest with

LINE and

HOPE beingroughly 5 times slower;

DeepWalk is the slowest. As expected, the dimension continues to slow the algorithms down.The most drastic difference is for the decomposition of the variance. In all earlier experiments, we used to report largevalues of the parameter r E . This time, r E is by an order of magnitude smaller. As mentioned earlier when we introducedthis ratio, this indicates that the main contribution to the variance comes from the randomness of the graph generationprocess, not from the randomized embedding algorithms. This is expected as the distribution of the community sizesis the most fragile parameter of the ABCD model (in fact, any model, including

LFR ). Indeed, with non-negligibleprobability the number of components might be substantially different for two graphs with the same set of parametersbut independently generated. That was the main reason why we ﬁxed the number of communities to 5 communities13

PREPRINT - F

EBRUARY

17, 2021instead of generating one sequence of communities with, say, β = 1 . and keeping it for all experiments. This way webalanced the two sources of randomness, from the graph generation process and from embedding algorithms themselves. In this section, we tested the 5 most important parameters of the

ABCD model with the hope to better understand theinﬂuence of basic statistics of real-world networks on the quality of embedding algorithms measured by the divergencescore. In general, the distribution of community sizes, controlled by the parameter β in the ABCD model, has the singlelargest impact on the results. Hence, we decided to ﬁx 5 community sizes throughout, except when testing β itself; inthat experiment, there are many more communities and we observe that the divergence score increases with β . Theimpact is not as clear and strong when the size of the network ( n ) or the degree distribution ( γ and ∆ ) changes. node2vec was constantly winning in all the experiments we performed. VERSE was good but not in the presence ofmany small communities or large level of noise, and only in low dimension. These observations are also observed in theexperiments with real graphs we presented earlier.

SDNE seems to be the most unpredictable algorithm but it may bethe case because it aims to capture different aspects of the graph such as the role of particular nodes within networkinstead of preserving densities between communities.The conclusion is that node2vec algorithm seems to be a good ﬁrst choice to try in general but, depending on thegraph that needs to be embedded, other algorithms give similarly good results. Moreover, node2vec has a numberof parameters itself which potentially might produce even better outcomes. We investigate this aspect in the next setof experiments. In general, there is no clear winner (speciﬁc algorithm with a speciﬁc set of parameters) that worksbest and so there is a need to be able to guide our choice of the algorithm and its parameters by a trustworthy andunsupervised benchmark framework.Execution time of all algorithms is sensitive with respect to the dimension. However, the increase of the computingtime is signiﬁcantly larger for node2vec , DeepWalk and

VERSE than for

LINE , HOPE and

SDNE . node2vec and DeepWalk are the most computationally intensive, however, they support parallel execution and their run-time can becontrolled by available CPU cores.

SDNE is constantly the fastest from all tested algorithms; roughly 20 times fasterthan

DeepWalk . In these ﬁnal experiments on the synthetic graphs, we focus on node2vec , the embedding algorithm that seems toperform best for both real-world networks we experimented with as well as synthetic graphs generated by the

ABCD model. The goal is to investigate how the behaviour of node2vec changes for various parameters of this algorithm.Recall that the return parameter p controls the likelihood of 2-hop redundancy in the corresponding random walk.Large values of this parameter decrease the probability that already-visited node is sampled in the following two steps.On the other hand, small values of the in-out parameter q encourages DFS -like exploration whereas large values canbe used to emulate

BFS -like exploration. In our ﬁrst experiment, we independently generated ABCD graphs usingthe default set of parameters, family F . For a given dimension d ∈ { , , , , , } , a given set of parameters ( p, q ) of node2vec , we independently run the algorithm on 5 graphs from F , 5 times for each graph. We computed theaverage divergence score a node vec,d ( p, q ) and the standard deviation s node vec,d ( p, q ) (over experiments; 5 graphs,5 embeddings per graph). The results are presented in Figure 12 in the following form.Plot 6. We plot 5 functions a node vec,d ( p, q ) for the selected dimensions d ∈ { , , , , } with the returnparameter ﬁxed to p = 1 as a function of the parameter q ∈ { , , , , , / , / , / , / } . We additionallydisplay conﬁdence bands: a node vec,d ( p, q ) ± s node vec,d ( p, q ) .Plot 7. We plot 5 functions a node vec,d ( p, q ) for the selected dimensions d ∈ { , , , , } with the in-outparameter ﬁxed to q = 1 as a function of the parameter p ∈ { , , , , , / , / , / , / } . We additionallydisplay conﬁdence bands: a node vec,d ( p, q ) ± s node vec,d ( p, q ) .In the second experiment, we investigate the inﬂuence of the second set of parameters: the number of walks which hasthe default value of k = 10 , and the walk length that is set to w = 80 by default. The results are presented in Figure 13in the following form.Plot 8. We plot 5 functions a node vec,d ( k, w ) for the selected dimensions d ∈ { , , , , } with the number ofwalks ﬁxed to k = 10 as a function of the walk length w ∈ { , , } . We additionally display conﬁdencebands: a node vec,d ( k, w ) ± s node vec,d ( k, w ) . 14 PREPRINT - F

EBRUARY

17, 2021 A v e r a g e d i v e r g e n c e a A , d ( n ) DeepWalk A v e r a g e d i v e r g e n c e a A , d ( n ) N2V A v e r a g e d i v e r g e n c e a A , d ( n ) SDNE A v e r a g e d i v e r g e n c e a A , d ( n ) HOPE A v e r a g e d i v e r g e n c e a A , d ( n ) LINE A v e r a g e d i v e r g e n c e a A , d ( n ) VERSE A v e r a g e d i v e r g e n c e a A , d ( n ) Dimension 32

N2VDeepWalkLINEHOPESDNEVERSE 2000 4000 6000 8000 10000n0.0000.0010.0020.0030.0040.0050.006 A v e r a g e d i v e r g e n c e a A , d ( n ) Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 2000 4000 6000 8000 10000n0.0000.0010.0020.0030.0040.0050.006 A v e r a g e d i v e r g e n c e a A , d ( n ) Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 1 and 2: a A,d ( n ) ± s A,d ( n ) A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) DeepWalk A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G )

1e 5

N2V A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) SDNE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) VERSE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) Dimension 32

N2VDeepWalkLINEHOPESDNEVERSE 2000 4000 6000 8000 10000n0.00000.00020.00040.00060.0008 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 2000 4000 6000 8000 10000n0.00000.00010.00020.00030.00040.00050.00060.0007 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 3 and 4: average s A,d ( n, G ) (over 10 graphs) . . . . . . Average divergence a A , d ( n , G )0.000000.000050.000100.000150.000200.000250.00030 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) Pearson's correlation: 0.48

DeepWalk . . . . . . . . Average divergence a A , d ( n , G )012345678 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G )

1e 5 Pearson's correlation: 0.6

N2V .

001 0 .

002 0 .

003 0 .

004 0 . Average divergence a A , d ( n , G )0.00020.00040.00060.00080.00100.00120.0014 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) Pearson's correlation: -0.03

SDNE . . . . . . . . Average divergence a A , d ( n , G )0.00000.00010.00020.00030.00040.00050.0006 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( n , G ) Pearson's correlation: 0.86

VERSE

Plot 5: correlation between a A,d ( n, G ) and s A,d ( n, G ) Figure 7: Size of the Network ( n )15 PREPRINT - F

EBRUARY

17, 2021 A v e r a g e d i v e r g e n c e a A , d () DeepWalk A v e r a g e d i v e r g e n c e a A , d () N2V A v e r a g e d i v e r g e n c e a A , d () SDNE A v e r a g e d i v e r g e n c e a A , d () HOPE A v e r a g e d i v e r g e n c e a A , d () LINE A v e r a g e d i v e r g e n c e a A , d () VERSE A v e r a g e d i v e r g e n c e a A , d () Dimension 32

N2VDeepWalkLINEHOPESDNEVERSE 2.2 2.4 2.6 2.8 3.00.0000.0010.0020.0030.0040.0050.006 A v e r a g e d i v e r g e n c e a A , d () Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 2.2 2.4 2.6 2.8 3.00.0000.0010.0020.0030.0040.0050.006 A v e r a g e d i v e r g e n c e a A , d () Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 1 and 2: a A,d ( γ ) ± s A,d ( γ ) A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) DeepWalk A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G )

1e 5

N2V A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) SDNE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) VERSE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 32

N2VDeepWalkLINEHOPESDNEVERSE 2.2 2.4 2.6 2.8 3.00.00000.00010.00020.00030.00040.00050.0006 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 2.2 2.4 2.6 2.8 3.00.00000.00010.00020.00030.00040.00050.0006 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 3 and 4: average s A,d ( γ, G ) (over 10 graphs) . . . . . . . Average divergence a A , d ( , G )0.0000000.0000250.0000500.0000750.0001000.0001250.0001500.000175 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.86

DeepWalk . . . . . . Average divergence a A , d ( , G )0.00.51.01.52.02.53.0 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G )

1e 5 Pearson's correlation: 0.72

N2V . . . . . . . . Average divergence a A , d ( , G )0.00020.00030.00040.00050.00060.00070.00080.0009 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: -0.07

SDNE . . . . . . Average divergence a A , d ( , G )0.000000.000050.000100.000150.000200.00025 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.83

VERSE

Plot 5: correlation between a A,d ( γ, G ) and s A,d ( γ, G ) Figure 8: Degree Distribution ( γ )16 PREPRINT - F

EBRUARY

N2VDeepWalkLINEHOPESDNEVERSE 0 100 200 300 400 500 6000.0000.0010.0020.0030.0040.0050.0060.007 A v e r a g e d i v e r g e n c e a A , d () Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 0 100 200 300 400 500 6000.0000.0010.0020.0030.0040.0050.0060.007 A v e r a g e d i v e r g e n c e a A , d () Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 1 and 2: a A,d (∆) ± s A,d (∆) A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G )

1e 5

DeepWalk A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G )

1e 5

N2VDeepWalkLINEHOPESDNEVERSE 0 100 200 300 400 500 6000.00000.00010.00020.00030.00040.00050.00060.00070.00080.0009 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 0 100 200 300 400 500 6000.00000.00010.00020.00030.00040.00050.00060.00070.00080.0009 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 3 and 4: average s A,d (∆ , G ) (over 10 graphs) . . . . . . . . Average divergence a A , d ( , G )0.51.01.52.02.53.03.5 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G )

1e 5 Pearson's correlation: 0.49

N2V . . . . . . Average divergence a A , d ( , G )0.000000.000020.000040.000060.000080.000100.000120.00014 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.85

DeepWalk .

001 0 .

002 0 .

003 0 .

004 0 . Average divergence a A , d ( , G )0.00020.00040.00060.00080.0010 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.28

SDNE . . . . . Average divergence a A , d ( , G )0.00000.00010.00020.00030.00040.0005 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.87

VERSE

Plot 5: correlation between a A,d (∆ , G ) and s A,d (∆ , G ) Figure 9: Maximum Degree ( ∆ )17 PREPRINT - F

EBRUARY

N2VDeepWalkLINEHOPESDNEVERSE 0.2 0.4 0.6 0.8 1.00.0000.0010.0020.0030.0040.005 A v e r a g e d i v e r g e n c e a A , d () Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 0.2 0.4 0.6 0.8 1.00.0000.0010.0020.0030.0040.005 A v e r a g e d i v e r g e n c e a A , d () Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 1 and 2: a A,d ( ξ ) ± s A,d ( ξ ) A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) DeepWalk A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) N2V A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) SDNE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) VERSE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 32

N2VDeepWalkLINEHOPESDNEVERSE 0.2 0.4 0.6 0.8 1.00.00000.00020.00040.00060.00080.00100.0012 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 0.2 0.4 0.6 0.8 1.00.00000.00020.00040.00060.00080.00100.0012 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 3 and 4: average s A,d ( ξ, G ) (over 10 graphs) a A , d ( , G )0.00000.00010.00020.00030.00040.0005 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.61

DeepWalk a A , d ( , G )0.00000.00010.00020.00030.00040.00050.00060.0007 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.93

N2V a A , d ( , G )0.00000.00020.00040.00060.00080.00100.0012 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.98

SDNE a A , d ( , G )0.00000.00010.00020.00030.00040.00050.0006 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.72

VERSE

Plot 5: correlation between a A,d ( ξ, G ) and s A,d ( ξ, G ) Figure 10: Level of Noise ( ξ )18 PREPRINT - F

EBRUARY

N2VDeepWalkLINEHOPESDNEVERSE 1.2 1.4 1.6 1.8 2.00.0000.0050.0100.0150.0200.0250.0300.035 A v e r a g e d i v e r g e n c e a A , d () Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 1.2 1.4 1.6 1.8 2.00.0000.0050.0100.0150.0200.0250.0300.0350.040 A v e r a g e d i v e r g e n c e a A , d () Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 1 and 2: a A,d ( β ) ± s A,d ( β ) A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) DeepWalk A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) N2V A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) SDNE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) VERSE A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 32

N2VDeepWalkLINEHOPESDNEVERSE 1.2 1.4 1.6 1.8 2.00.00000.00020.00040.00060.00080.0010 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 64

N2VDeepWalkLINEHOPESDNEVERSE 1.2 1.4 1.6 1.8 2.00.00000.00020.00040.00060.0008 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Dimension 128

N2VDeepWalkLINEHOPESDNEVERSE

Plots 3 and 4: average s A,d ( β, G ) (over 10 graphs) a A , d ( , G )0.00000.00010.00020.00030.0004 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.2

DeepWalk a A , d ( , G )0.00000.00010.00020.00030.00040.00050.0006 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.64

N2V a A , d ( , G )0.00020.00040.00060.00080.00100.00120.00140.00160.0018 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: 0.68

SDNE a A , d ( , G )0.000000.000250.000500.000750.001000.001250.001500.00175 A v e r a g e s t a n d a r d d e v i a t i o n s A , d ( , G ) Pearson's correlation: -0.16

VERSE

Plot 5: correlation between a A,d ( β, G ) and s A,d ( β, G ) Figure 11: Community Sizes ( β )19 PREPRINT - F

EBRUARY

17, 2021 L o g o f D i v e r g e n c e S c o r e p=1.0 Dim8Dim16Dim32Dim64Dim128 1/9 1/7 1/5 1/3 1 3 5 7 9p9.89.69.49.29.08.88.6 L o g o f D i v e r g e n c e S c o r e q=1.0 Dim8Dim16Dim32Dim64Dim128

Figure 12: node2vec : the inﬂuence of the return parameter p and the in-out parameter q (plots 6 and 7).Plot 9. We plot 5 functions a node vec,d ( k, w ) for the selected dimensions d ∈ { , , , , } with the walklength ﬁxed to w = 80 as a function of the number of walks k ∈ { , , } . We additionally displayconﬁdence bands: a node vec,d ( k, w ) ± s node vec,d ( k, w ) .

40 80 160Walk Length0.000080.000100.000120.000140.000160.00018 D i v e r g e n c e S c o r e Number of Walks = 10

Dim8Dim16Dim32Dim64Dim128 5 10 20Number of Walks0.000080.000100.000120.000140.000160.00018 D i v e r g e n c e S c o r e Walk Length = 80

Dim8Dim16Dim32Dim64Dim128

Figure 13: node2vec : the inﬂuence of the number of walks k and the walk length w (plots 8 and 9).The results of the experiments are consistently very good. We see a slight improvement of the divergence score whenthe ratio between q and p increases. On the other hand, the number of walks and the walk length seem to be even morestable without a visible improvements. One surprising thing to notice is that dimension 8 gave the best result, betterthan dimension 128. The remaining dimensions behave as expected: the divergence score improves as the dimensionincreases.Finally, let us stress that these results are performed only on the ABCD model with a given set of default parametersdeﬁned at the beginning of this section (in particular, in the presence of a relatively low level of noise) but the fact thatthe result are stable and very good is another reason to use node2vec as a default embedding algorithm. . . . But Can One Trust the Framework?

In all experiments we performed in this paper we measured the quality of embedding algorithms by computing thedivergence score returned by the benchmarking framework. The divergence score measures to which degree thefollowing natural and desired properties are satisﬁed. Embeddings that score well extract enough information from thegraph that allows one to reconstruct the number of edges between communities as well as within them. In particular,pairs of nodes that are close in the embedded space often tend to be adjacent and vice versa—there are some sporadiclong edges but they are not so common. (See Section 3 for a longer discussion.)However, despite the fact that the deﬁnition of the divergence score seems to be natural, the following importantquestions arise:

Should one trust the divergence store in making decisions whether or not to use a given outcome of the PREPRINT - F

EBRUARY

17, 2021 embedding algorithm?

What if the framework favours embeddings that perform poorly when feed as an input for somemachine learning algorithm? In order to answer this question, we highlight a few of the most common applications ofgraph embedding algorithms and show that the performance of the corresponding tools highly depend on the divergencescore. Of course, the list is not intended to be complete and there are many other important potential applications onemight want to explore.

Node classiﬁcation is an example of a semi-supervised learning algorithm where labels are only available for a smallfraction of nodes and the goal is to label the remaining set of nodes based on this small initial seed set. This is a situationoften observed in, for example, social networks in which labels might indicate a user’s interests, beliefs, or demographiccharacteristics. There could be many reasons for labels not to be available for a large fraction of nodes, for example, auser’s demographic information might not be available to protect their privacy. Our task is then to infer missing labelsbased on the small set of labeled nodes and the structure of the graph.Since embedding algorithms can be viewed as the process of extracting features of the nodes from the structure of thegraph, one may reduce the problem to a classical machine learning predictive modelling classiﬁcation problem for theset of vectors. There are many algorithms, such as logistic regression, k -nearest neighbours, decision trees, XGBoost,support vector machine, etc., for any potential scenario that one might be interested in, including binary, multi-class,and multi-label classiﬁcations.For our experiment, we used the synthetic ABCD graph with n = 10 , and all parameters set to their default values(in particular, the conﬁguration model with the global variant was used) except parameter β that was set to β = 1 . .The reason for deviating from the default 5 communities is to introduce more challenging scenario for the classiﬁerwith a larger number of communities (42 communities were generated by the random model). Indeed, the communityof each node of this graph is its ground-truth community provided by the ABCD model and the goal of the classiﬁer isto predict them.The set of nodes was randomly partitioned into a training set (with 75% of the nodes) and a test set (with the remaining25%). For each embedding algorithm A and each dimension d ∈ { , , , , , } , we used the labels from thetraining set and their embeddings to train the Gradient Boosted Trees ( XGBoost ) model. This model was used as it isintegrated with a number of packages making it easy to use, and the model itself has some additional nice features whichdistinguish it from other gradient boosting algorithms. Having said that, let us stress that our goal in this experiment isnot to classify nodes as best as possible but rather to detect a possible correlation between the quality of a classiﬁer andthe divergence score of embeddings used. Similar conclusions can be derived by using other models which might ormight not give better results. Following the same argument,

XGBoost was used with default hyper-parameters valuesand no tuning was done as part of the experiment. We applied the model on the embeddings of the nodes from the testset to predict the labels of the corresponding nodes. Based on the ground-truth, we then computed the overall accuracy(the fraction of predictions our model got right).For a given algorithm A and a given dimension d , we repeated the above procedure 10 times, independently andrandomly splitting the set of nodes into training and test sets. The average accuracy depends on the choice of A and d .More importantly, there seems to be a strong correlation between the accuracy and the quality of the embedding based onthe framework, the divergence score—see Figure 14 (left). Indeed, the correlation coefﬁcient is equal to − . , whichshows that there is a signiﬁcant correlation between the two metrics but the relation seems to be non-linear. In particular,embeddings with low divergence score achieve high accuracy whereas embeddings with high divergence exhibitvarying accuracy. In other words, large divergence score does not imply that the performance of node classiﬁcationalgorithm is going to be poor but low divergence score seems to guarantee the success. We also noticed that, in general,the dimension cannot be too small (4 or 8) but the average accuracy quickly stabilizes and there is no need to useembeddings in very large dimensions—see Figure 14 (right). There are various techniques and algorithms for detecting communities in networks. Node embeddings provide analternative tool for clustering related nodes, or they may be used to tune and to improve the graph tools with providingadditional, complementary information. Indeed, since each node is associated with a real-valued vector embedded in d -dimensional space, one may alternatively ignore the initial graph and apply some generic clustering algorithm to theset of associated vectors. Clustering points seems to be a much easier task and is a well-studied area of research withmany scalable algorithms, such as k -means or DBSCAN, that are easily available for use.21 PREPRINT - F

EBRUARY

17, 2021 A v e r a g e a cc u r a c y N2VDeepWalkLINEHOPESDNEVERSEDim4128 20 40 60 80 100 120Embedding dimension0.20.30.40.50.60.70.80.91.0 A v e r a g e a cc u r a c y N2VDeepWalkLINEHOPESDNEVERSE

Figure 14:

Nodes classiﬁcation : relation between the accuracy and the divergence score (left) and the dimension of theembedding (right).For our experiment, we used the synthetic

ABCD graph with n = 10 , and all parameters set to their default values.For a given algorithm A and a given dimension d ∈ { , , , , , } , we independently run 10 times k -means algorithm with the correct number of clusters (namely, k = 5 ). As we discussed in the previous experiment, our goalis to investigate if there is a correlation between the divergence score and the quality of the embedding and we donot aim to detect communities as best as possible. Hence, we use the “vanilla” k -means algorithm instead of somemore advanced tools such as, for example, DB-scan. In order to compare the results with the ground-truth communitystructure generated by the ABCD model, we compute the average

Adjusted Mutual Information ( AMI ) score, awidely used measure based on information theory. As before, there is a correlation between the average AMI score andthe average divergence score—see Figure 15 (left). The correlation between the two measures is equal to − . . Thereseems to be one outlier with a very low AMI score, HOPE ; the correlation without this algorithm is − . . The qualityof LINE increases with the dimension whereas

SDNE does the opposite—see Figure 15 (right). The remaining fouralgorithms are invariant from that perspective. A v e r a g e A M I S c o r e N2VDeepWalkHOPESDNEVERSELINEDim4128

20 40 60 80 100 120Embedding Dimension0.20.40.60.81.0 A v e r a g e A M I S c o r e algorithmN2VDeepWalkHOPESDNEVerseLINE Figure 15:

Community Detection : relation between the AMI score and the divergence score (left) and the dimensionof the embedding (right).

Node embeddings can be successfully used to predict missing links or to predict links that are likely to be formedin the future. Indeed, networks are often constructed from the observed interactions between nodes, which may beincomplete or inaccurate. In particular, the situation of missing links is typical in the analysis of biological networks inwhich verifying the existence of links between nodes requires experiments that are expensive and might not be accurate.Moreover, a task that is closely related to link prediction is the main ingredient of recommendation systems. The goal22

PREPRINT - F

EBRUARY

17, 2021might be to predict missing friendship links in social networks or to recommend new friends. Another task might be topredict new links between users and possible products that they may like.Once nodes are embedded in d -dimensional space, one may use the distance between the corresponding vectors tomake the prediction. Nodes that are close to each other in the embedded space but are not adjacent might get connectedin the near future as they seem to be similar to each other. On the other hand, since networks that we typically mineare dynamic, one might be interested in predicting which links will become inactive; for example, which users onInstagram a given user might want to unfollow in the near future. A natural guess would be to pick nodes that are far inthe embedded space as it indicates that the nodes are dissimilar.As before, for our experiment we used the synthetic ABCD graph G with n = 10 , and all parameters set to theirdefault values. This time, we randomly select 10% of the edges of G , set E , and remove them, thus forming a newgraph G (cid:48) = G \ E . Then we take another random sample of non-adjacent pairs of nodes in G , set E (cid:48) . Both classeshave the same number of pairs of nodes so that the test set created in such a way is balanced. Our goal is to train amodel that uses the embedding of graph G (cid:48) to detect which pairs of nodes in E ∪ E (cid:48) are adjacent in G .For a given algorithm A and a given dimension d ∈ { , , , , , } , we ﬁnd the embedding of graph G (cid:48) . Anatural strategy would be to consider all pairs of adjacent nodes in G (cid:48) (the positive class) as well as a random subset ofpairs of non-adjacent nodes (the negative class) and use some standard tools such as the logistic regression model onsuch training set. Such tools combine embeddings of two nodes into a feature vector to be used for prediction. Theoutput is an estimation of the probability for the positive class in the training data set. However, in the spirit of the twoearlier experiments, we keep it simple and compute the L -distance between pairs of the selected edges and non-edgesfrom E ∪ E (cid:48) . Based on this, we compute the class membership score using the following simple formula: for each uv ∈ E ∪ E (cid:48) , p ( uv ) = 1 − d ( u, v ) d max where d ( uv ) is the L -distance between nodes u and v , and d max = max uv ∈ E ∪ E (cid:48) d ( uv ) is maximum distance in theentire set. In particular, as desired, nodes that are close to each other are predicted to be adjacent with high probabilityand nodes that are far away from each other are most likely non-adjacent. In order to measure the quality of thissimple model, we use the Area Under the ROC Curve ( AUC ) that provides a measure of separability as it tells ushow capable the model is of distinguishing between the two classes. Indeed,

AUC is bounded from above by 1 andcan be interpreted as probability that a random positive observation has a higher predicted probability than a randomnegative observation.We repeat the above procedure 10 times for each algorithm A and dimension d , each time independently selecting10% of edges of G to form graph G (cid:48) . We compute the average AUC score as well as the average divergence score.As in the previous experiments, there is a strong correlation between the

AUC score and the divergence score: thecorrelation coefﬁcient is equal to − . . Indeed, Figure 16 (left) shows that the relation is not linear but there is a cleartrend, as expected and as desired. Surprisingly, some of the embedding algorithms achieve lower AUC scores in higherdimensions but the difference is signiﬁcant only for

VERSE —see Figure 16 (right). However, note that our simpleclassiﬁer rely exclusively on distances while embeddings in higher dimensions may capture other features that mightbe used by more sophisticated classiﬁers. In any case, these results conﬁrm that more useful information is kept byembeddings that score well using the benchmarking framework.To further support this conclusion, we investigated two embeddings that scored well by the framework ( node2vec , d = 4 and node2vec , d = 128 ) and two that were ranked as the worst ones ( HOPE , d = 4 and SDNE , d = 128 )—seeTable 13. For each of them, we plot a distribution of lengths independently for E and E (cid:48) —see Figures 17 and 18. Goodembeddings keep adjacent nodes close to each other whereas bad embeddings actually do the opposite.Table 13: Performance metrics for distance-based link prediction. Embedding AUC Accuracy Divergence node2vec d = 4 − node2vec d = 128 − HOPE d = 4 − SDNE d = 128 − PREPRINT - F

EBRUARY

17, 2021 A U C N2VDeepWalkLINEHOPESDNEVERSEDim4128 0 20 40 60 80 100 120Embedding dimension0.30.40.50.60.70.8 A U C N2VDeepWalkLINEHOPESDNEVERSE

Figure 16:

Link Prediction : relation between the AUC score and the divergence score. L n o r m L n o r m Figure 17: Two of the worst embeddings according to the divergence score—

HOPE , d = 4 and SDNE , d = 128 . L n o r m L n o r m Figure 18: Two of the best embeddings according to the divergence score— node2vec , d = 4 and node2vec , d = 128 .24 PREPRINT - F

EBRUARY

17, 2021 . . .

Yes! One Can Trust the Framework!

Node embedding is an important tool to extract useful information from graphs. There are many excellent algorithmsproposed in the literature but the quality of their outcomes depends on the structure of the network that one aims toprocess. As a group of researchers and practitioners that often use embedding algorithms, in this project we aimed toinvestigate various algorithms (using different techniques to build them) to be able to make a better and more informedchoices which ones to use. The conclusion we converged to is to use node2vec as a default choice—this algorithmconstantly works good for both real world networks as well as synthetically generated ones. Having said that, theother algorithms were often at least comparable if not slightly better, but the competitors change from experiment toexperiment. Moreover, each algorithm (including node2vec ) has a number of parameters one can tune, the dimensionbeing only one of them.In light of this unclear best choice of the algorithm and its parameters, we recommend to use the benchmarkingframework to make that decision in an unsupervised way, without manually inspecting the quality of the generatedembeddings. In order to support this recommendation, we performed a number of experiments in which we apply someclassical tools to important machine learning tasks and measured if there is a correlation between the divergence scorereturned by the benchmarking framework and the quality of the tool that uses embeddings to guide the process. Astrong correlation between the two measures supports the recommended approach.Having said that, there are some natural followup questions that can be asked and experiments to be performed. Let usmention about two of them. We measured how the divergence score depends on some simple statistics of the graphsuch as the level of noise or degree distribution. In order to achieve it, we experimented with the

ABCD graph in whichwe ﬁxed all but one parameters and vary the one that is not ﬁxed. To get a more detailed understanding of the effecton the divergence score, it would be interesting to see how a combination of two parameters affect the quality of theembeddings. Second experiment that we would like to suggest is related to the applications of node embeddings. In thispaper we tested three natural applications (nodes classiﬁcation, community detection, and link prediction). Anotherimportant application is to detecting anomalies. It would be interesting to see how the quality of various anomalydetection algorithms depends on the divergence score.

Experiments were conducted using

SOSCIP Cloud infrastructure. Launched in 2012, the

SOSCIP consortium is acollaboration between Ontario’s research-intensive post-secondary institutions and small- and medium-sized enterprises(SMEs) across the province. Working together with the partners,

SOSCIP is driving the uptake of AI and data sciencesolutions and enabling the development of a knowledge-based and innovative economy in Ontario by supportingtechnical skill development and delivering high-quality outcomes.

SOSCIP supports industrial-academic collaborativeresearch projects through partnership-building services and access to leading-edge advanced computing platforms,fuelling innovation across every sector of Ontario’s economy.For our experiments, we used Compute G4-x8 (8 vCPUs, 32 GB RAM) machines and Ubuntu 18.04 operating system.Computation used for experimentation and calibration of the scripts took approximately 20,000 vCPU-hours. Thescripts and results can be found on GitHub repository . References [1] Newman M. Networks: An Introduction. Oxford University Press; 2010.[2] A.L. Barabasi. Network Science. Cambridge U Press, 2016.[3] Bianconi G. Interdisciplinary and physics challenges of network theory. EPL. 2015; 111(5):56001.[4] Hoff PD, Raftery AE, Handcock MS. Latent space approaches to social network analysis, J. Amer. Stat. Assoc.2002; 97(460) 1090-1098.[5] Krioukov D. Clustering means geometry in networks. Phys Rev Lett. 2016; 208302(May):1-5.[6] Zuev K, Boguna M, Bianconi G, Krioukov D. Emergence of Soft Communities from Geometric PreferentialAttachment. Scientiﬁc Reports. 2015; 5,9421.[7] Gastner MT, Newman MEJ. The spatial structure of networks. European Physical Journal B. 2006; 49(2):247-252. https://github.com/arash-dehghan/EmbeddingComplexNetworks PREPRINT - F