[PDF] Node Proximity Is All You Need: Unified Structural and Positional Node and Graph Embedding

Abstract

While most network embedding techniques model the relative positions of nodes in a network, recently there has been significant interest in structural embeddings that model node role equivalences, irrespective of their distances to any specific nodes. We present PhUSION, a proximity-based unified framework for computing structural and positional node embeddings, which leverages well-established methods for calculating node proximity scores. Clarifying a point of contention in the literature, we show which step of PhUSION produces the different kinds of embeddings and what steps can be used by both. Moreover, by aggregating the PhUSION node embeddings, we obtain graph-level features that model information lost by previous graph feature learning and kernel methods. In a comprehensive empirical study with over 10 datasets, 4 tasks, and 35 methods, we systematically reveal successful design choices for node and graph-level machine learning with embeddings.

Full PDF

NNode Proximity Is All You Need:Uniﬁed Structural and Positional Node and Graph Embedding

Jing Zhu ∗† Xingyu Lu ∗† Mark Heimann ‡ Danai Koutra ∗ Abstract

While most network embedding techniques model therelative positions of nodes in a network, recently therehas been signiﬁcant interest in structural embeddings that model node role equivalences , irrespective of theirdistances to any speciﬁc nodes. We present PhU-SION, a proximity-based uniﬁed framework for comput-ing structural and positional node embeddings, whichleverages well-established methods for calculating nodeproximity scores. Clarifying a point of contention in theliterature, we show which step of PhUSION producesthe diﬀerent kinds of embeddings and what steps can beused by both. Moreover, by aggregating the PhUSIONnode embeddings, we obtain graph-level features thatmodel information lost by previous graph feature learn-ing and kernel methods. In a comprehensive empiricalstudy with over 10 datasets, 4 tasks, and 35 methods, wesystematically reveal successful design choices for nodeand graph-level machine learning with embeddings.

Node embeddings model node similarities in a multi-dimensional feature space: the more similar two nodesare in a network, the closer they lie in this space. Twobroad categories of node similarity are prevalent in theliterature: (i) positional proximity, which embeds closenodes similarly [1]; and (ii) structural similarity, whichembeds nodes similarly if they have similar roles orpatterns of interaction with other nodes, irrespectiveof their relative locations [2]. In turn, these similaritieslead to positional or proximity-preserving embeddings,and structural or role-based embeddings, respectively.Characterizing the relationship between proximity-preserving and structural node embeddings is an openand contested problem, with recent works making op-posing claims. For instance, Rossi et al. character- ∗ Computer Science & Engineering, University of Michigan.Email: {jingzhuu, luxingyu, mheimann, dkoutra}@umich.edu † Authors contributed equally to this work. ‡ Lawrence Livermore National Laboratory. Work partiallycompleted while a student at the University of Michigan. ize these classes of methods as fundamentally diﬀer-ent methodologically and in terms of applications [3].Meanwhile, concurrent work proposed a theoretical framework in which the analogous concepts are actuallyequivalent for downstream tasks [4]. However, accord-ing to [3], it is unclear how this theoretical frameworkmaps onto real-world graph mining methods.A seminal work, NetMF [5], showed that variouspositional node embeddings amount to the same em-bedding technique (matrix factorization) applied to var-ious matrices capturing pairwise node proximity scores.Going further, we propose PhUSION, a proximity-based uniﬁed framework for computing structural and positional node embeddings. PhUSION has three steps:(i) computation of pairwise node proximities, (ii) appli-cation of a nonlinear ﬁlter, and (iii) application of adimensionality-reducing embedding function. We showwhich steps can be used for proximity-preserving orstructural embedding and which step makes them dif-ferent, revealing similarities and diﬀerences between thetwo classes of methods.Additionally, PhUSION generalizes existing meth-ods and yields novel ones from 35 diﬀerent combinationsof design choices, some of which improve on the varia-tions studied in the literature. We extensively performan empirical study of possible design choices for bothstructural and proximity-preserving node embeddings,to understand what works and why. In particular, non-linear ﬁltering has very recently been identiﬁed [6] asa key ingredient to the success of proximity-preservingnode embedding. We analyze this observation in muchgreater detail for proximity-preserving embeddings andfor the ﬁrst time apply it to structural embeddings.We extend PhUSION to embed entire graphs, aproblem for which separate solutions have been pro-posed using graph signatures and similarity scores de-rived from node proximity matrices [7, 8] and aggre-gated structural node embeddings [9]. Since we haveshown that node proximity matrices can be used to de-rive structural node embeddings, we interpret previousmethods [7, 8] as embedding aggregation; we use PhU-SION to learn more expressive graph features by ag-

Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited a r X i v : . [ c s . S I] F e b regating our more informative node embeddings, thatmodel information we show that previous works cannot.Our contributions are summarized as follows:• Unifying Embedding Perspective : We proposePhUSION, which can use pairwise node proximitymatrix to generate embeddings that model node sim-ilarity based on structural roles or positional prox-imity. Our analysis of PhUSION shows the techni-cal similarities and diﬀerences between structural andproximity-preserving node embeddings, a contestedopen question [3, 4].•

Study of Successful Design Choices : On bench-mark tasks for proximity-preserving and structuralembedding choices, we investigate the combination ofnode proximity matrices, nonlinear transformation,and embedding functions. Our results uncover newinsights that can improve both proximity-preservingand structural embeddings.•

Graph-Level Learning : We turn PhUSION into amethod for learning features for entire networks fromtheir node proximity matrices based on node embed-ding aggregation. We interpret previous graph ker-nels [8] and feature learning methods [7] as simpliﬁedversions of PhUSION, and show what information wecan capture with more expressive design choices thatthese previous works cannot.We provide code and additional supplementary materialat https://github.com/GemsLab/PhUSION . Node embed-dings are latent feature vectors for nodes in a networkthat are similar for similar nodes. Most embeddingmethods deﬁne node similarity in terms of proxim-ity (e.g. direct or indirect connection) within a sin-gle graph. In contrast, structural embedding methodscapture a node’s structural role independent of its prox-imity to speciﬁc nodes; this independence makes embed-dings comparable across distant parts of a graph [10] orseparate graphs [11, 9]. Both kinds of embeddings maybe obtained using a diverse range of shallow and deeplearning methods. For more information, we refer thereader to a survey [1] on proximity-preserving or posi-tional embeddings, and a recent comprehensive empiri-cal study on structural or role-based embeddings [10].The plethora of node embedding methods has raisedinterest in ﬁnding unifying frameworks for diﬀerentmethods, which can also lead to new technical ad-vances. For example, many proximity-preserving em-bedding methods were shown to implicitly factorize dif-ferent proximity-based node similarity matrices; this in-sight inspired the NetMF method based on explicit ma- trix factorization [5]. It is known that many (proximity-preserving) node embedding methods can be summa-rized as a two-step process of node similarity matrixconstruction and dimensionality reduction [12]. How-ever, PhUSION is the ﬁrst framework to subsume bothproximity-preserving and structural embedding meth-ods. Moreover, in light of recent work [6], we carefullystudy a third step of applying a nonlinearity before per-forming dimensionality reduction.

Graph Comparison.

For comparing entire graphs,aggregating node embeddings (as we do) is competitiveto deep neural networks, graph kernels, and featureconstruction [9]. Because a graph’s node proximitymatrix captures important information, many workshave sought to use this within -graph information for cross -graph comparison. A challenge is that nodes indiﬀerent graphs may not correspond. Feature learningmethod NetLSD [7] and graph kernel RetGK [8] solvethis problem by only considering node self-similarities,which forgoes directly modeling a node’s similarityto other nodes (cross- node similarities). Other graphsimilarity functions such as DeltaCon [13] model cross-node similarities, but are restricted to graphs deﬁned onthe same set of nodes. However, PhUSION can modelwithin-graph cross-node similarities for more expressivegeneral cross-graph comparison.

In this section, we present the abstract steps of our PhU-SION framework for node and graph feature learning,before describing concrete choices in the next section.

Preliminaries.

We consider a graph G with node set V and adjacency matrix A containing edges betweennodes. We learn an n × d matrix Y of d -dimensionalnode embeddings, where the i -th row Y i is a featurerepresentation for node i . For ease of reference, wedeﬁne common quantities for graph learning and nodeembedding, along with parameters speciﬁc to certainnode proximity functions, in Tab. 1. Structural vs Positional Embeddings . Structuralnode embedding should learn similar features for auto-morphically equivalent or near-equivalent nodes [10, 4],even if they are distant from each other in the net-work. On the other hand, for nodes to have similarpositional embeddings, they must be close in the net-work. Although these are two very diﬀerent embeddingoutcomes, the steps we present below can generate ei-ther kind of embedding; later, we will show concretelywhere the diﬀerence arises.

For learning node fea-tures from a graph with adjacency matrix A , we per-form the following three steps: Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited able 1: Symbols and deﬁnitions

Symbol Deﬁnition

Standardgraphmatrices A Adjacency matrix D Diagonal matrix of node degrees L Unnormalized Laplacian matrix ( D − A ) L + Pseudoinverse of LR Random walk transition matrix ( D − A ) k Matrix powerPhUSIONfunctions

Ψ()

Node proximity function σ () Nonlinear transformation function ζ () Embedding function S Matrix of node proximities S = Ψ( A )˜ S Matrix of nonlinearly ﬁltered node proximities ˜ S = σ (Ψ( A )) Y Matrix of node embeddings Y = ζ ( σ (Ψ( A ))) PPMI [5] vol( G ) Σ i,j A ij T Window sizeb Parameter for negative samplingHeatkernel[2, 7] g s Filter kernel with scaling parameter s Λ Diagonal matrix of eigenvalues of LU Eigenvectors of L ( L = UΛU T )FaBP [14] h h (cid:113) ( − c + (cid:112) c + 4 c ) / c ,where c = trace ( D ) + 2 ; c = trace ( D ) − a h h / (1 − h h ) c h h / (1 − h h )PPR [15] β Decay parameter

Step 1:

Calculate node proximities S using a function Ψ( A ) ; Step 2:

Filter these proximities via a nonlinearity func-tion ˜ S = σ ( S ) ; and Step 3:

Embed the transformed proximities using adimensionality reduction function: Y = ζ (˜ S ) .Our node embedding framework can be precisely sum-marized by function composition:(3.1) Y = ζ ( σ (Ψ( A ))) Multiscale Node Embeddings . Many proximityfunctions can be tuned with scaling parameters to cap-ture more local or global proximity [2, 16]. We cancreate multiscale embeddings by concatenating embed-dings using the same node proximity function at severaldiﬀerent scales:(3.2) Y = || i Y ( s i ) = Y ( s ) || Y ( s ) || . . . || Y ( s t ) , where embeddings at each individual scale are computedwith Eq. (3.1) using the desired scale parameter tocompute node proximity: Y ( s i ) = ζ ( σ (Ψ( A ; s i ))) . We can aggregate agraph’s node embeddings into a single feature vectorthat describes the entire graph using a function ρ () :(3.3) f = ρ ( Y ) We now propose concrete function choices for Eqs. (3.1)-(3.3), and characterize general and speciﬁc choices.

Ψ() . The ﬁrst step of our framework, PhUSION, is to createa matrix of pairwise node proximities S ∈ R n × n . S ij should be large for nodes that are close in the graph(e.g. neighbors) and small for faraway nodes. Diﬀerentproximity matrices have been used not only for nodeembedding but throughout graph mining, including:• Positive pointwise mutual information ( PPMI ) [5]: S = vol ( G ) bT ( (cid:80) Tr =1 R r ) D − .• Heat kernel ( HK ) [2]: S = U g s ( Λ ) U (cid:62) .• Belief Propagation ( FaBP ) [14]: S = ( I + a D − c A ) − .• Personalized Pagerank ( PPR ) [15]: S = ( I − β A ) − ( β A ) .• Laplacian pseudoinverse ( L + ) [6]: S = L + , whichapproximates the PPMI matrix as the window size T → ∞ , up to a low-rank correction term.• Powers of the adjacency matrix ( Adj ) [15, 16] orrandom walk matrix ( RW ) [8]: S = A k or S = R k . σ () . As a preprocessing step before em-bedding, we can ﬁlter the node proximities with a non-linear function σ ( S ) . Recent work [6] argues that suchnonlinearity is largely responsible for the performancegain of recent deep learning-inspired node embeddingmethods. Thus, we consider the following functions:• No nonlinearity: σ ( S ) = S ( Identity function).• Elementwise logarithm (

Log ): For proximity-preserving embedding with

PPMI , we set σ ( S ) i,j =log(max { S i,j , } ) [5]. For other matrices with valuesconcentrated in [0 , , we propose to keep more infor-mation by only ﬁltering out negative or zero elements: σ ( S ) i,j =  , S i,j ≤ S i,j min( S + ) ) , S i,j > where min( S + ) is the smallest positive element in S .• Thresholded binarization ( Bin-p ) [6]: Let a ∈ N bethe p -th percentile ( p % smallest element) in S . Then σ ( S ) is deﬁned elementwise as: σ ( S ) i,j = (cid:40) , S i,j ≤ a , S i,j > a ζ () . Given a (ﬁltered) similarity matrix ˜ S , node embeddingslearn low-dimensional feature representations using var-ious dimensionality reduction techniques. We representthe embedding process as a function ζ (˜ S ) .• One way to generate d -dimensional embeddings is byfactorizing the node similarity matrix, prototypically Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited ith singular value decomposition (SVD) [5]. Basedon rank- d SVD ˜ S ≈ U d Σ d V d , we can obtain the nodeembeddings as ζ (˜ S ) = U d Σ d .• Another way to generate a d -dimensional embeddingsfrom an n × n similarity matrix ˜ S is characteristicfunction sampling (CFS). For even dimensionality d ,we compute the embedding of each node u by sam-pling real and imaginary components from its empiri-cal characteristic function, φ u ( t ) = (cid:80) nv =1 exp( it ˜ S vu ) ,evaluated at d evenly spaced landmarks t , . . . , t d/ between 0 and 100 [2]. CFS is a permutation-invariant function applied row-wise to ˜ S that modelsthe distribution of a node’s proximity scores [2]. Special Cases.

PhUSION generalizes several existingproximity-preserving and structural embedding meth-ods, which we summarize in the following result:

Theorem 4.1.

Special cases of Eq. (3.2) include butare not limited to: GraphWave [2], NetMF [5], Inﬁnite-Walk [6], HOPE [15], GraRep [16], DNGR [17], andsRDE [18] for signed networks.Proof.

We give the constructions in App. A.1.

We isolate the embedding function ζ () as the responsible design choice for making PhUSIONyield positional or structural embeddings. Concretely,embedding a proximity matrix using SVD produces po-sitional embeddings, while using CFS (or any otherpermutation-invariant row function) produces struc-tural embeddings. On the other hand, any choice of Ψ() and σ () can yield positional or structural embeddings. Theorem 4.2.

Let connected graphs G , G have anisomorphism π : V → V , i.e. a bijective mappingbetween the nodes and A = PA P (cid:62) , where the binarymatrix P has nonzero elements exactly at the entries ( π ( i ) , i ) for i ∈ [1 , . . . , | V | ] . Deﬁne a combined graph G with block diagonal adjacency matrix A = [ A , ; , A ] ,so that π encodes an automorphism within G . Assumethat node proximity and nonlinearity functions Ψ() and σ () preserve this automorphism: ˜ S = P ˜ S P (cid:62) , where ˜ S i = σ (Ψ( A i )) . Also assume that disconnected nodeshave proximity score (unchanged by nonlinearity),so that ˜ S = σ (Ψ( A )) = [˜ S , ; , ˜ S ] . Let Y be thecombined embeddings of G , which can be split intoembeddings Y (1) and Y (2) corresponding respectively tothe nodes originally in G and G . Then:1. If Y = SVD (˜ S ) , then Y (1) i (cid:54) = Y (2) π ( i ) .2. If Y = CFS (˜ S ) , or more generally any permutation-invariant function ζ (˜ S ) , then Y (1) i = Y (2) π ( i ) . Proof. See supplementary App. A.2.

Note:

Some existing methods learn structural em-beddings with implicit or explicit matrix factoriza-tion [19, 11], which in PhUSION would produce po-sitional embeddings. The key diﬀerence is that thesemethods do not factorize a pairwise node proximity ma-trix, but a structural similarity matrix (where discon-nected nodes may have a nonzero similarity score). Oneadvantage of PhUSION is that the node proximity ma-trices we use are well studied throughout graph mining.

Our PhUSION framework also produces features thatdescribe an entire graph, when we aggregate its nodes’embeddings into a single feature vector. Here, we showthat two recent graph kernels and feature maps are inessence special cases of PhUSION.

PhUSION:NetLSD . NetLSD computes graph fea-tures from its heat kernel matrix at multiple scales [7].For scales s , . . . , s d , the resulting d -dimensional featurevector has as its i -th entry h ( s i ) , the trace of the heatkernel matrix at scale s i . For size invariance, the au-thors propose normalizing an n -node graph’s featuresby the heat trace of the n -node empty graph, whichamounts to multiplying by n . Thus, the exact normal-ized NetLSD features are: n [ h ( s ) , . . . , h ( s d ) ] . Theorem 5.1.

NetLSD (using the heat kernel withempty graph normalization) is a special case of Eq. (3.3) where

Ψ() computes the graph’s heat kernel matrix atmultiple scales s as its proximity matrix S , ζ ( S ) = diag ( S ) , σ () is the identity function, and ρ () averagesthe embeddings.Proof. At scale s k , the one-dimensional node embed-ding of node i is given by y ( s k ) i = S ( s k ) ii . Thus, for d scales s , . . . , s d , the multiscale embedding of node i given by Eq. (3.2) is y i = [ S ( s ) ii , . . . , S ( s d ) ii ] . Aggregatingthese node features into graph features using Eq. (3.3)gives f = n (cid:80) i y i = n [ (cid:80) i S ( s ) ii , . . . , (cid:80) i S ( s d ) ii ] = n [ Tr ( S ( s ) ) , . . . , Tr ( S ( s d ) )] . When S is the heat kernelmatrix, each term becomes Tr ( S ( s i ) ) = h ( s i ) . PhUSION:RetGK . The scalable graph kernel(RetGK II ) [8] based on approximate feature maps [20]is deﬁned as K ( G , G ) = κ (cid:16) f ( G ) , f ( G ) (cid:17) . Withoutnode attributes, f ( G ) = (cid:80) ni =1 φ ( y i ) where the j -thentry of y i is the return probability of a random walk oflength j starting from node i (formally R jii ), and φ is afeature map approximating a vector-valued kernel [20].It can thus be seen that RetGK has essentially thesame form as the other methods: Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited heorem 5.2.

Without node attributes and with φ and κ both set to the linear kernel, RetGK is a special case ofEq. (3.3) where: for multiple values of the parameter s , Ψ() computes the graph’s s -step random walk transitionmatrix as its proximity matrix S , ζ ( S ) = diag ( S ) , σ () isthe identity function, and ρ () averages the embeddings. In practice, [8] proposes to set φ to be a randomFourier feature map to approximate the Gaussian ker-nel [20], and κ to be a Gaussian or Laplace kernel, ap-plying the successive embedding trick used for graphkernels [21]. Node attributes may be incorporated bytaking the Kronecker product of the attribute vectorswith the embeddings [8]. All of these techniques readilyapply to any of the other methods we have proposed. Expressive Graph Comparison with PhUSION.

Postprocessing aside, we can interpret RetGK andNetLSD as instances of PhUSION: they average mul-tiscale embeddings learned from diﬀerent node proxim-ity matrices ( HK for NetLSD, RW for RetGK). However,they use a 1-dimensional embedding function mappingnodes to their corresponding diagonal elements in S . Ofcourse, this simple embedding loses oﬀ-diagonal infor-mation in S (namely, inter-node proximities), which ourembeddings capture. To show the greater expressivityof our embeddings Y by a fair comparison, we also usemean pooling for our ρ ( Y ) , although more complex ag-gregation functions could be used [9]. To extensively evaluate PhUSION in a variety of con-texts, we consider several real datasets for node clas-siﬁcation (Tab. 2a) for which positional and structuralrole-based embeddings have been shown to be most ef-fective (§ 6.1). For the latter, we also use synthetic dataexhibiting clear role equivalences, the structure of whichwe can precisely control [2, 10]. We also evaluate ag-gregated structural embeddings for graph classiﬁcation(§ 6.2) on real benchmark datasets (Tab. 2b).

First we evaluatePhUSION in the node classiﬁcation task with positionaland structural node embeddings.

Setup.

We combine 7 node proximity functions

Ψ() and 5 diﬀerent nonlinearities σ () (including Identity ).Following our theoretical analysis (§4.4), we use SVD togenerate positional node embeddings and CFS to gen-erate structural embeddings. In total, the PhUSIONframework gives us

35 diﬀerent node embeddingmethods of each type, including positional embeddingsNetMF [5], InﬁniteWalk [6], and HOPE [15] and struc-tural embedding method GraphWave [2] as special cases.We tune hyperparameters with grid search and report

Table 2: Real Datasets (a) Node Classiﬁcation

Dataset P r o x i m . BlogCatalog [5] 10,312 333,983 Blogger Interests (39)

PPI [5] 3,890 76,584 Biological states (50)

Wikipedia [5] 4,777 184,812 Part-of-Speech tags (40) S t r u c t . Brazil [19] 131 1,038

Europe [19] 399 5,995

USA [19] 1,190 13,599 (b) Graph Classiﬁcation

Dataset [22] 1,500 13.00 Collaboration genre (3)

PROTEINS [22] 1,113 39.06 Protein type (2)

PTC-MR [22] 344 14.29 Molecular property (2) the procedure and best parameters in App. B. Interest-ingly, we ﬁnd that the best parameters strongly modellocal node proximity.We follow the supervised machine learning setupof [19]: we randomly sample 80 % of the dataset fortraining and the rest for testing. For multi-labelprediction, we use the one-vs-rest logistic regressionmodel [5] and evaluate using micro-F1 scores. We reportraw results for all 35 positional node embedding meth-ods derived from PhUSION in Fig. 1. Table 3 performsa drilldown on a per-design choice basis.

Results.

We can see that

PPMI does an excellent job,while L + is also competitive. As for the nonlinearity σ () , our ﬁndings support recent work [6] that addingnonlinearity is a critical part of outperforming theoriginal spectral embedding approaches: it is almostalways beneﬁcial for all proximity matrices. On average,we ﬁnd that Log does the best; however,

Bin-p alsoperforms better than

Identity (no nonlinearity), andindeed the best embedding method for two of the threedatasets (PPI and Wikipedia) uses binarization.The use of binarization as nonlinearity and L + forproximity was proposed by InﬁniteWalk [6], and the useof PPMI node proximities with

Log nonlinearity is theNetMF method [5]. Our ﬁndings conﬁrm that theserecently identiﬁed design choices are indeed among themost successful overall. However, new design choicesare competitive with them and may warrant furtherexploration. Moreover, no single choice of nonlinearityfunction σ () performs best, nor does performance varymonotonically with the sparsity of the resulting matrix( Bin-50 performs better than both

Bin-5 and

Bin-95 ).Corroborating [6], deeper characterization of variouschoices of σ () and their eﬀects is of continued interest. Observation 1. (1) Nonlinearity has a complex eﬀect,

Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited igure 1: Node classiﬁcation performance (micro-F1 scores) with positional embedding. Nonlinearitygenerally helps, but the best nonlinearity function varies across proximity matrices, and the best proximitymatrix varies across datasets.Table 3: Average rank and average/max micro-F1 scores of diﬀerent proximity

Ψ() and nonlinearityfunctions σ () on all datasets used for positional node embedding. Design choices used in existing methodsNetMF and InﬁniteWalk perform well on average (better than HOPE , which uses various Ψ() functionsbut no nonlinearity). However, new design combinations are competitive.

BlogCatalog PPI WikipediaAvg Rank Avg Acc Max Acc Avg Rank Avg Acc Max Acc Avg Rank Avg Acc Max Acc

Ψ()

PPMI HK PPR

FaBP L + A R σ () Identity

Log

Bin-5

Bin-50

Bin-95 but is essential in improving the performance ofpositional node embedding.(2) Generally, design choices identiﬁed by recentworks [5, 6] are among the most successful acrossdatasets, but new combinations are often competitive.

We now eval-uate the 35 methods we obtain from the PhUSIONframework for structural role-based node embedding intwo major tasks, node classiﬁcation and clustering.

Node Classiﬁcation.

We again perform supervisedmachine learning to predict the node labels from thenode embeddings, but in this case on datasets wherethe labels correspond to nodes’ structural roles. Weplot the accuracy of each combination of design choicein Fig. 2, and the average rank, mean and maximumaccuracy of each individual design choice in Tab. 4.

Node Clustering.

Following the literature on structuralnode embedding [2, 10], we also assess our methods us- ing networks that are constructed to manifest distinctivestructural roles. Our goal is to cluster nodes with simi-lar structural roles. We follow the dataset construction(cf. App. C) and clustering setup of [2]. These datasetsexhibit clear role equivalence (perturbed by noise). Forbrevity, we only report results from embeddings with-out nonlinearity. We assess the clustering quality usinghomogeneity, completeness, and silhouette score.

Results . Node Classiﬁcation.

We see diﬀerent trendsthan positional node embeddings. In this case, nonlin-earity is not always helpful; indeed

Identity is on aver-age much more competitive. However, all datasets, us-ing another proximity method or nonlinearity improveson GraphWave as proposed, highlighting the ﬂexibil-ity of PhUSION. We ﬁnd that a very simple nonlin-earity, binarization, produces the best methods on twodatasets: as CFS models the distribution of entries ineach row, embedding a binary distribution simply mod-els how many large proximities a node has to other

Real data (left) : Average rank and average/max accuracy of diﬀerent proximity

Ψ() andnonlinearity σ () functions on all datasets used for structural node embedding. Synthetic data (right) :Averaged clustering results for synthetic data with planted structural roles. For both tasks, we candramatically improve on GraphWave by using a diﬀerent proximity matrix and/or nonlinearity.

Brazil Europe USA SyntheticAvgRank AvgAcc MaxAcc AvgRank AvgAcc MaxAcc AvgRank AvgAcc MaxAcc Hom Comp Silh

Ψ()

PPMI HK .5951 .5488 .4392 PPR .9307

FaBP .7157 .6627 .5531 L + A .7156 .6750 .5760 R σ () Identity

Log

Bin-5

Bin-50

Bin-95

Figure 2: Node classiﬁcation performance with structural embeddings. Many diﬀerent proximity matricesand nonlinearity functions can yield high accuracy, often higher than existing method GraphWave. nodes. This corroborates a recent claim [10] that simplestructural information suﬃces for these datasets.

Node Clustering.

The results in Tab. 4 (right) showthat a variety of proximity matrices successfully clusternodes by their structural roles, in some cases better thanthe heat kernel used in GraphWave [2]. We show similarresults on unperturbed graphs in the supplementary § C.

Observation 2.

Within our PhUSION framework, wediscover design choices for structural embedding that im-prove on downstream tasks compared to existing meth-ods. In particular, we discover that some design choicesused for positional node embeddings, like nonlinearity,can improve structural embeddings as well.

Based on all our node-level experiments, we see that although the same designchoices prior to embedding (

Ψ() , σ () ) can be used for po- sitional or structural embeddings, in practice the bestdesign choices for each kind of embedding tend to be dif-ferent. For instance, nonlinearity is almost always help-ful for positional node embeddings, but only sometimeshelpful for structural embeddings. Proximity functions PPMI and L + tend to be successful for positional nodeembeddings, but do not produce the best structural em-beddings (clearly seen on the clustering tasks).This analysis raises an important question: Canwe characterize node proximity matrices that producegood embeddings of either type? We perform initial ex-ploratory analysis in App. E, investigating properties ofthe matrices produced by each combination of Ψ() and σ () . We ﬁnd that the row-wise sums of elements in ma-trices producing good positional node embeddings tendto have a bell-shaped distribution, whereas we observepower-law distributions in matrices that produce goodstructural embeddings. Observation 3.

While positional and structural node

Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited able 5: Graph classiﬁcation using averaged node embeddings (Eq. 3.3) and baselines (gray). We improveon NetLSD (3/3 datasets) and RetGK (2/3 datasets), which leverage simpler features from HK and RW matrices, using our embeddings of these matrices. We may also use diﬀerent proximity matrices like Adj ,which can further increase performance.

PPMI FaBP HK PPR Adj RW L + NetLSD RetGKIMDB-M . ± .

56 46 . ± .

27 48 . ± .

19 41 . ± . . ± . . ± .

32 45 . ± .

47 44 . ± .

05 43 . ± . PROTEINS . ± .

36 73 . ± .

19 73 . ± .

16 71 . ± .

08 72 . ± .

34 71 . ± .

17 70 . ± .

30 71 . ± . . ± . PTC-MR . ± .

40 55 . ± . . ± . . ± .

71 55 . ± .

77 58 . ± .

60 58 . ± .

59 58 . ± .

37 57 . ± . embeddings may begin with the same node proximitydesigns, in practice the best designs for each kind ofembedding method tend to diﬀer. This may be one reason why the survey work [3],characterizing existing examples of positional and struc-tural node embedding methods, judged their method-ology to be fundamentally diﬀerent (even though ourframework and the theory of [4] show a methodologicalconnection in principle).

We now investigatePhUSION’s eﬀectiveness in learning graph features fromvarious node proximity matrices. Intuitively, we expectthat our more expressive features will allow us to classifygraphs more accurately than previous works.

Setup.

Our experiments evaluate the graph classiﬁca-tion accuracy on PTC-MR, IMDB-M and PROTEINSdatasets [22]. As our focus is learning from the graphstructure alone, we ignore node attributes. We only useCFS (i.e. structural embeddings), which are compara-ble across graphs [9], and do not use nonlinearity σ () asthe baselines do not. We use a linear SVM to predictgraphs’ labels from their features; we report the 10-foldcross-validation accuracy averaged over 5 trials [9].We compare against NetLSD [7] and RetGK [8], al-ternative ways of deriving graph features from HK and RW proximity matrices, respectively (§5). We use NetLSD’sdefault 250 heat kernel values logarithmically spaced inthe range { − , } . We run RetGK using its defaultsof 50th-order random walk return probabilities and itsproposed exact and approximate successive kernel em-bedding ( κ and φ in §5). We describe our hyperpa-rameter settings in supplementary App. B; we paral-lel the settings of NetLSD and RetGK, and carefullyavoid giving ourselves any unfair advantage over them(in fact, they have a slight advantage if anything: weleave NetLSD with its default higher dimensionality andRetGK with its default successive kernel embeddings). Results.

In Tab. 5, we see that our methods generallyimprove on NetLSD and RetGK as a way of gettinggraph features from their node proximity matrices. Inparticular, embedding RW using Eq. 3.2 outperforms RetGK, which is also based on the RW proximity matrix,on two out of three datasets (PTC-MR and IMDB-M).Similarly embedding HK outperforms NetLSD, whichalso uses the heat kernel matrix, on all three datasets(and outperforms all other methods on two datasets).This is strong evidence that by modeling each node’s fulldistribution of proximities rather than its self-proximity,PhUSION captures more useful information.Because we keep the embedding dimension thesame as (or lower) than NetLSD and RetGK, whichcapture only a single value for a node at each proximityscale (whereas we return a 10-dimensional embedding),we necessarily consider much fewer scales. Our goodcomparative performance indicates that modeling moregraph information at fewer scales is generally superiorto modeling less information at more scales. Observation 4.

PhUSION gives us a way to learngraph features from a given node proximity matrix thatyield greater accuracy than previous works [7, 8], likelybecause of their expressivity (§ 5).

For all our classiﬁcationtasks, we also study the eﬀect of proximity order formultiscale embeddings in the supplementary App. D. Ingeneral, we ﬁnd that modeling strongly local informa-tion with low-order proximity yields good performance(and is computationally cheapest).

We have proposed the ﬁrst unifying perspective thatencompasses both proximity-preserving and structuralnode embedding methods, clarifying their contestedtechnical relationship [4, 3]. This allows us to learneither kind of node embedding from any node proxim-ity matrix that can be computed on a graph, whicharises throughout the ﬁeld of graph mining. Our three-step framework PhUSION opens up a variety of designchoices (we empirically study 35), encompassing exist-ing methods and also producing novel ones. We pro-vide insights into productive design choices for node-level graph mining using either kind of embedding. Byaggregating a graph’s embeddings, we can derive graph-level features from the node proximities; we show pre-

Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited isely what information we can capture that is lost byother graph kernels and feature learning methods.Within PhUSION there is still room to explore moredesign choices, such as other embedding functions (e.g.nonlinear autoencoders used by a few methods for posi-tional node embedding [17], or trainable characteristicfunction sampling recently proposed for node and graphembedding [23]). For graph embedding, other designsuse successive kernel embedding and the incorporationof node attributes [8]. Furthermore, fast approximatecomputation of node proximities can allow PhUSIONto scale to very large graphs [5, 2].

Acknowledgements

This work is supported by NSF Grant No. IIS 1845491,Army Young Investigator Award No. W9-11NF1810397,and Adobe, Amazon, Facebook, and Google facultyawards. Any opinions, ﬁndings, and conclusions orrecommendations expressed in this material are thoseof the authors and do not necessarily reﬂect the viewsof the funding parties.

References [1] Palash Goyal and Emilio Ferrara. Graph embed-ding techniques, applications, and performance: Asurvey.

Knowledge-Based Systems , 2018.[2] Claire Donnat, Marinka Zitnik, David Hallac, andJure Leskovec. Learning structural node embed-dings via diﬀusion wavelets. In

KDD , 2018.[3] Ryan A. Rossi, Di Jin, Sungchul Kim, Nesreen K.Ahmed, Danai Koutra, and John Boaz Lee. Onproximity and structural role-based embeddings innetworks: Misconceptions, methods, and applica-tions.

TKDD , 2020.[4] Balasubramaniam Srinivasan and Bruno Ribeiro.On the equivalence between positional node em-beddings and structural graph representations. In

ICLR , 2020.[5] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li,Kuansan Wang, and Jie Tang. Network embeddingas matrix factorization: Unifying deepwalk, line,pte, and node2vec. In

WSDM , 2018.[6] Sudhanshu Chanpuriya and Cameron Musco. In-ﬁnitewalk: Deep network embeddings as laplacianembeddings with a nonlinearity. In

KDD , 2020.[7] Anton Tsitsulin, Davide Mottin, Panagiotis Kar-ras, Alexander Bronstein, and Emmanuel Müller.Netlsd: hearing the shape of a graph. In

KDD ,2018.[8] Zhen Zhang, Mianzhi Wang, Yijian Xiang, YanHuang, and Arye Nehorai. Retgk: Graph kernels based on return probabilities of random walks. In

NeurIPS , 2018.[9] Mark Heimann, Tara Safavi, and Danai Koutra.Distribution of node embeddings as multiresolutionfeatures for graphs. In

ICDM , 2019.[10] Junchen Jin, Mark Heimann, Di Jin, and DanaiKoutra. Understanding and evaluating structuralnode embeddings. In

KDD MLG Workshop , 2020.[11] Mark Heimann, Haoming Shen, Tara Safavi, andDanai Koutra. Regal: Representation learning-based graph alignment. In

CIKM , 2018.[12] Cheng Yang, Maosong Sun, Zhiyuan Liu, and Cun-chao Tu. Fast network embedding enhancementvia high order proximity approximation. In

IJCAI ,2017.[13] Danai Koutra, Joshua T Vogelstein, and ChristosFaloutsos. Deltacon: A principled massive-graphsimilarity function. In

SDM , 2013.[14] Danai Koutra, Tai-You Ke, U Kang, DuenHorng Polo Chau, Hsing-Kuo Kenneth Pao, andChristos Faloutsos. Unifying guilt-by-associationapproaches: Theorems and fast algorithms. In

PKDD , 2011.[15] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang,and Wenwu Zhu. Asymmetric transitivity preserv-ing graph embedding. In

KDD , 2016.[16] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep:Learning graph representations with global struc-tural information. In

CIKM , 2015.[17] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deepneural networks for learning graph representations.In

AAAI , 2016.[18] Mark Heimann, Goran Murić, and Emilio Fer-rara. Structural node embedding in signed socialnetworks: Finding online misbehavior at multiplescales. In

Complex Networks , 2020.[19] Leonardo FR Ribeiro, Pedro HP Saverese, andDaniel R Figueiredo. struc2vec: Learning noderepresentations from structural identity. In

KDD ,2017.[20] Ali Rahimi and Benjamin Recht. Random featuresfor large-scale kernel machines. In

NeurIPS , 2008.[21] Giannis Nikolentzos and Michalis Vazirgiannis. En-hancing graph kernels via successive embeddings.In

CIKM , 2018.[22] Christopher Morris, Nils M Kriege, Franka Bause,Kristian Kersting, Petra Mutzel, and Marion Neu-mann. Tudataset: A collection of benchmarkdatasets for learning with graphs. In

ICML GRLWorkshop , 2020.[23] Benedek Rozemberczki and Rik Sarkar. Charac-teristic functions on graphs: Birds of a feather,

CIKM , 2020.

ProofsA.1 Existing Node Embedding Methods asSpecial Cases of Eq. (3.2).

For all the methods inTheorem 4.1, we list the speciﬁc choices of node proxim-ity

Ψ() , nonlinearity σ () , and embedding ζ () functions(as well as whether or not they use multiscale proximity)that make them conform to our framework.• GraphWave [2]: the node proximity Ψ() computes thegraph’s heat kernel matrix, the nonlinearity σ () isthe identity function, and the embedding function ζ () is characteristic function sampling. The multiscaleversion of GraphWave is given by Equation 3.2.• NetMF [5]: the node proximity Ψ() computes thegraph’s PPMI matrix, the nonlinearity σ () is Log , andthe embedding function ζ () is SVD.• InﬁniteWalk [6]: the node proximity Ψ() computesthe PPMI matrix in the window size limit T = ∞ , orthe Laplacian pseudoinverse L + as an approximationof this quantity up to a low-rank correction term.The nonlinearity σ () is Log (or, for the Laplacianpseudoinverse, the authors consider

Bin-p ), and theembedding function ζ () is SVD.• HOPE [15]: the node proximity Ψ() computes thepersonalized pagerank matrix or the common neigh-bors matrix A , the nonlinearity σ () is the identityfunction, and the embedding is SVD (possibly ap-proximated for scalability [15]).• GraRep [16]: the node proximity Ψ() is derivedfrom powers of the adjacency matrix, the nonlinearity σ () is Log , and the embedding function is SVD;this method computes multiscale node embeddingsby concatenating embeddings derived from diﬀerentpowers of the adjacency matrix.• DNGR [17]: the node proximity

Ψ() computes thegraph’s PPMI matrix (in a slightly diﬀerent waythan NetMF), the nonlinearity σ () is Log , and thenonlinear embedding function ζ () is implementedwith a stacked denoising autoencoder.• sRDE [18]: the node proximity Ψ() in a signed network is computed using a signed random walk withrestart procedure, the nonlinearity σ () is the identityfunction, and the embedding function ζ () consists ofcomputing a histogram (which is also permutation-invariant) of each node’s signed proximity scores. A.2 Embedding Functions that Produce Posi-tional vs. Structural Node Embeddings

Here wegive the proof of Theorem 4.2:

Proof.

Part 1: SVD yields diﬀerent embeddingsfor automorphic nodes.

Recall that ﬁnding theSVD of ˜ S = [˜ S , ; , ˜ S ] is equivalent to ﬁndingthe eigendecomposition of ˜ S ˜ S (cid:62) : the singular vectors(columns of U ) are the eigenvectors and the singularvalues (diagonal entries of Σ ) the square roots ofeigenvalues of ˜ S ˜ S (cid:62) . Since the embeddings are formedfrom the ﬁrst d columns of U and Σ , we equivalentlyanalyze the eigendecomposition of ˜ S ˜ S (cid:62) .1. ˜ S and ˜ S are similar matrices and thus have thesame eigenvalues and eigenvectors.2. ˜ S ˜ S (cid:62) has the same eigenvalues as ˜ S (equivalently, ˜ S ). First, we show that all eigenvalues of ˜ S , ˜ S are eigenvalues of ˜ S ˜ S (cid:62) : if ˜ S v λ = λ v λ , then ˜ S ˜ S (cid:62) [ v λ , ] = λ [ v λ , ] ; ˜ S ˜ S (cid:62) [ , v λ ] = λ [ , v λ ] . Con-versely, we also show that all eigenvalues of ˜ S ˜ S (cid:62) are eigenvalues of ˜ S , ˜ S . Without loss of general-ity we can write any eigenvector v of ˜ S split in halfas [ v , v ] , such that ˜ S ˜ S (cid:62) [ v , v ] = λ [ v , v ] . Then ˜ S ˜ S (cid:62) [ v , v ] = [˜ S ˜ S (cid:62) , ; , ˜ S ˜ S (cid:62) ][ v , v ] = [˜ S v + , + ˜ S v ] = [˜ S v , ˜ S v ] . Since [ v , v ] wasan eigenvector of ˜ S ˜ S (cid:62) , [˜ S v , ˜ S v ] = λ [ v , v ] andthus ˜ S v = λ v and ˜ S = v , meaning that λ isalso an eigenvalue of ˜ S and ˜ S .3. Thus, each of the top singular vectors of ˜ S thatform the dimensions of Y which form the embeddingdimensions up to weighing by the singular values,has the form [ , v λ ] or [ v λ , ] . (Since the graphs areconnected i.e. nonempty, v λ (cid:54) = .) That is, alongany dimension the nodes in one graph will have anonzero embedding value and the nodes in the othergraph will have a zero embedding value.This is of course an extreme case for a highly con-trived example (perfectly automorphic nodes in per-fectly disconnected components of a graph), but in gen-eral we can see (and the research community has foundexperimentally on real-world networks) that the SVDembeddings encode positional rather than structural in-formation, and nodes in very diﬀerent parts of the graphwill generally not be close in the embedding space. Part 2: Permutation-invariant row functionssuch as CFS yield identical embeddings for au-tomorphic nodes.

Let n be the number of nodes ineither graph G or G . Then the ﬁrst n nodes in G corre-spond to G and the second n nodes in graph correspondto G . So for node i ∈ [1 , . . . , n ] , the ID of its counter-part under the isomorphism π is π ( i )+ n . Thus, we wantto show that the rows of node i and node π ( i ) + n in ˜ S are equivalent up to permutation. Formally, we showthat for any i, j ∈ [1 , . . . , n ] , ˜ S ij = ˜ S π ( i )+ n,π ( j )+ n . Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited et e i be the i -th standard basis. Then ˜ S ij = ˜ S ij = e i ˜ S e (cid:62) j = e i P (cid:62) ˜ S Pe (cid:62) j = e π ( i ) ˜ S e (cid:62) π ( j ) = ˜ S π ( i ) π ( j ) =˜ S π ( i )+ n,π ( j )+ n . This shows that any nonzero elementin the i -th row of ˜ S (which must occur in the ﬁrst n elements) has a corresponding element among thesecond j elements of the ( π ( i ) + n ) -th row. Of course,the second n elements in the i -th row and the ﬁrst n elements in the ( π ( i ) + n ) -th row of ˜ S are zeros. Thus,these rows have the same elements and are identical upto permutation. B Node Proximity HyperparametersFor positional node embeddings : All embeddingshave the standard 128 dimensions [5]. We tuned thehyperparameters of the node proximity functions

PPMI , PPR , HK , and FaBP on the Wikipedia dataset via gridsearch over the following values:1. HK : we tried scale values of s ∈ [0 . , . , , , , ,and ﬁnd best performance from s = 0 . .2. PPR : we tried decay parameter values β ∈ [0 . , . , . , . , and ﬁnd best performance from β = 0 . .3. PPMI : we tried window size T ∈ [2 , , and foundlittle diﬀerence, so we use T = 10 with the approxi-mate NetMF method [5].4. FaBP : we tried values for the parameters a, c ∈{ . , . , , } . We found little diﬀerence for valuesof a , but smaller c can lead to better performance,so we chose a = 1 and c = 0 . . For structural embeddings : On these smallergraphs, all embeddings are 50-dimensional. We tunedthe hyperparameters of the node proximity functions

PPMI , PPR , HK , and FaBP on the USA dataset via gridsearch over the following values:1. HK : we used multiscale embeddings following [2].We found that on the airports datasets, theirautomatic scale selection procedure yielded un-intuitively large and poorly performing scales. Thus, we tried { , , , , } , { . , , , , } and { . , . , , , } , and ﬁnd best performancefrom the latter.2. PPR : we tried decay parameter values β ∈ [0 . , . , . , . , and ﬁnd β = 0 . works best. For example, applying the oﬃcial implemention of Graph-Wave [2] using the automatic scale selection on USA-airportsdataset gives a range of scale parameters s min = 2014340 . and s max = 8763076 . . PPMI : we tried window size T ∈ [2 , , and foundthat T = 10 achieves best performance.4. FaBP : we tried parameter values a, c ∈ [0 . , . , , , but in the end we found that theheuristic proposed in [14] for setting a and c worksbest: a = 4 h h / (1 − h h ) , c = 2 h h / (1 − h h ) . Here the“about-half” homophily factor h h = (cid:114) − c + √ c +4 c c where c = Tr ( D ) + 2 , c = Tr ( D ) − . For graph classiﬁcation : For HK , we use scale param-eters s ∈ { . , . , , , } to parallel NetLSD. Forproximity functions computed by matrix powers ( Adj and RW ), we consider powers k ∈ { , , , , } . At eachof the ﬁve parameter settings, we learn 10-dimensionalembeddings and use Eq. 3.2 to form a multiscale em-bedding with 50 dimensions (to match or stay belowthe modeling capacity of NetLSD and RetGK). BetweenNetLSD’s higher (250) dimension and RetGK’s succes-sive kernel embeddings, our experimental setup givesNetLSD and RetGK each a small advantage. C Clustering Structural Node Embedding:Additional Details and Results

We use the synthetic graph generation pipeline providedby GraphWave [2]. The graphs are given by 5 basicshapes of one of diﬀerent types (“house”, “fan”, “star”) [2]that are placed on a cycle of length 30. In the mainpaper, we add 10% random edges to perturb the other-wise perfect role equivalences of nodes in the same partof diﬀerent shape; however, in Tab. 6, we include clus-tering results on noiseless networks exhibiting perfectrole equivalence. We use agglomerative clustering (withsingle linkage) to cluster the node embeddings learnedby each method.

PPMI FaBP HK PPR A R L + Homogeneity 0.8738

Table 6: Clustering results on noiseless syntheticdatasets. HK used in GraphWave is outperformedon all metrics by other proximity matrices. D Proximity Order in Structural Embedding

The order of proximity that node embeddings model hasbeen shown to be very important. While some methodsby default model low-order (e.g. 2nd-order) proximi-ties [10], other methods try to balance low-order andhigh-order proximities to capture local and global in-formation. This has been done with multiscale embed-dings, whether positional [16] or structural [2]. Setting

Setup.

For node and graph classiﬁcation, we considerthe eﬀect of varying the order for methods consisting ofpowers of a (ﬁltered) similarity matrix ˜ S (i.e. Adj or RW ) when computing multiscale embeddings (Eq. 3.2).We only consider up to 4th order for positional nodeembeddings due to the larger size of those datasets(which is also why we omit the largest dataset Blog-Catalog). For any value of k , we compute the embed-dings from each power of ˜ S , ˜ S , . . . , ˜ S k (using Log non-linearity for positional node embeddings and

Identity for structural embeddings, nonlinearity functions whichperformed well on average for each kind of embeddingin § 6) and concatenate the resulting embeddings. Notethat for graph classiﬁcation experiments, we now learna 50-dimensional embedding at each scale, as we are notcomparing to baseline methods now.

Results . The results are shown in Fig. 3. We can see,conﬁrming the intuition of prior structural embeddingmethods [10] that lower order proximity is suﬃcient forbest performance and saves the computational expenseof computing higher order node proximities (whichamounts to additional multiplications of increasinglydense matrices).

Observation 5.

Modeling low-order node proximity(however, beyond ﬁrst-order proximity, or direct edgeconnections alone) is generally suﬃcient for both kindsof embedding methods.

E Proximity Matrix Properties for EﬀectiveNode Embeddings

A powerful tool for the design of future node embed-ding methods would be an intrinsic characterizationof successful design choices for node embedding; thiscould allow for eﬀective model selection without rely-ing on extrinsic evaluation (i.e. performance on down-stream tasks as in § 6). The node embedding step usu-ally leverages standard dimensionality reduction tech-niques; from a graph mining perspective, the most in-teresting part is the construction of (potentially nonlin-early transformed) node proximities. Thus, we seek tounderstand: how can we characterize choices Ψ( σ ( A )) that yield useful (positional or structural) node embed-dings? While eﬀective intrinsic analysis of node embed-ding methods is a major open question, we present someinitial exploratory analysis to prompt further investiga-tion. E.1 Positional Embeddings

In a node proximitymatrix, the sums of each row correspond to the totalproximity scores each node has to all other nodes. Our (a) Proximity node classi-ﬁcation: Multiscale

Adj (b) Proximity node classi-ﬁcation: Multiscale RW (c) Structural node classi-ﬁcation: Multiscale Adj (d) Structural node classi-ﬁcation: Multiscale RW (e) Graph classiﬁcation:Multiscale Adj (f) Graph classiﬁcation:Multiscale RW Figure 3: Eﬀect of proximity order on node andgraph classiﬁcation. Low order proximities aresuﬃcient to achieve good performance. intuition is that if we are to expect good positionalnode embeddings, most nodes should have a moderateamount of total proximity to other nodes—too low, andthe embedding objective will have too little similarityinformation to learn an eﬀective embedding; too high,and the embedding objective will try to embed this nodeindiscriminately similarly to many other nodes.

Setup.

In Fig. 4, we visualize the distribution ofrow sums of all node proximity matrices, arising fromeach combination of node proximity and nonlinearityfunction that we evaluated in this work.

Results.

Some of these distributions exhibit a bellcurve shape with the values concentrated in the middleof the distribution, while others exhibit a power lawdistribution with a single long tail. (Note that for the

Bin-5 nonlinearity, the tail is on the left as most valuesin the matrix are 1, so low row sums are the exception.In general, the tail consists of the large row sums, as istypical for most power law distributions.)Many successful design choices produce a bell shape

Ψ() and σ () on BlogCatalog. In general, some of the best-performing embedding methods (darker colors) comefrom matrices whose row sum distributions follow a bell curve rather than a power law. distribution of row sums. For example, the Log nonlin-earity ﬁlter (the best-performing nonlinearity on aver-age) produces bell-shaped row sum distributions for allproximity matrices.

Bin-95 produces a somewhat bell-shaped distribution for

PPMI , one of the best-performingproximity matrices. In general, this lines up with ourintuition to expect mostly moderate row sums.

Observation 6.

Some of the matrices yielding thebest positional node embeddings have a bell curve ratherthan a power law distribution of row sums: that is,most nodes have moderate total proximity scores to allother nodes.

E.2 Structural Embeddings

The CFS embeddingmethod treats each row of the proximity matrix asa probability distribution. When learning structuralembeddings using CFS, GraphWave notes that two caseswill be uninformative for structural embedding: if the distribution of row entries is either too uniform or if ithas too few nonzero values.

Setup.

To measure the row-wise uniformity of the ma-trices, we consider the variance of each row. Meanwhile,we use entropy to diagnose rows with a few large entriesand otherwise mostly small ones. Thus, we plot the row-wise distribution of variances and entropy for each prox-imity matrix, as well as the distribution of row sums asin § E.1. For brevity, we do not consider nonlinearity.Note that for

PPR and L + , we truncate entropy valuesclose to −∞ to − for ease of visualization. Results . For most proximity matrices, the row-wisedistribution of all three statistics tends to follow a powerlaw distribution. The row-wise sum and variance distri-butions for the

PPMI matrix, which generally leads tosome of the the weaker structural embedding methods,tend to follow this pattern much more noisily. Proximitymatrices

PPR and

FaBP , on the other hand, tend to followan extreme power law distribution with a very thin tail.This may indicate that a moderate power law distri-

Copyright © 2021 by SIAMUnauthorized reproduction of this article is prohibited igure 5: Distribution of row distribution statistics (degree, variance, and entropy) of proximity matrices(without nonlinearity) on USA Airports dataset used for structural embeddings. Methods leading to moreaccurate structural embeddings have darker color. Some of the best embeddings come from matrices whoserows’ variance and entropy follow a power law distribution. bution may be the most informative for structural em-beddings. The contrast with proximity-preserving em-beddings (§ E.1), where the most successful embeddingstended to come from matrices with a bell-curve distri-bution of row sums, corroborates our ﬁnding that thebest positional and structural node embedding methodstended to use very diﬀerent design choices.

Observation 7.