[PDF] Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning

Abstract

Graph Representation Learning (GRL) methods have impacted fields from chemistry to social science. However, their algorithmic implementations are specialized to specific use-cases e.g.message passing methods are run differently from node embedding ones. Despite their apparent differences, all these methods utilize the graph structure, and therefore, their learning can be approximated with stochastic graph traversals. We propose Graph Traversal via Tensor Functionals(GTTF), a unifying meta-algorithm framework for easing the implementation of diverse graph algorithms and enabling transparent and efficient scaling to large graphs. GTTF is founded upon a data structure (stored as a sparse tensor) and a stochastic graph traversal algorithm (described using tensor operations). The algorithm is a functional that accept two functions, and can be specialized to obtain a variety of GRL models and objectives, simply by changing those two functions. We show for a wide class of methods, our algorithm learns in an unbiased fashion and, in expectation, approximates the learning as if the specialized implementations were run directly. With these capabilities, we scale otherwise non-scalable methods to set state-of-the-art on large graph datasets while being more efficient than existing GRL libraries - with only a handful of lines of code for each method specialization. GTTF and its various GRL implementations are on: this https URL

Full PDF

PPublished as a conference paper at ICLR 2021 G RAPH T RAVERSAL WITH T ENSOR F UNCTIONALS :A META -A LGORITHM FOR S CALABLE L EARNING

Elan Markowitz *,1,2 , Keshav Balasubramanian *,1 , Mehrnoosh Mirtaheri *,1,2 , Sami Abu-El-Haija *,1,2 * Equal Contribution University of Southern California, USC Information Sciences Institute

Bryan Perozzi

Google Research

Greg Ver Steeg , Aram Galstyan A BSTRACT

Graph Representation Learning (GRL) methods have impacted ﬁelds from chem-istry to social science. However, their algorithmic implementations are special-ized to speciﬁc use-cases e.g. message passing methods are run differently from node embedding ones. Despite their apparent differences, all these methods uti-lize the graph structure, and therefore, their learning can be approximated withstochastic graph traversals. We propose Graph Traversal via Tensor Functionals(GTTF), a unifying meta-algorithm framework for easing the implementation ofdiverse graph algorithms and enabling transparent and efﬁcient scaling to largegraphs. GTTF is founded upon a data structure (stored as a sparse tensor) anda stochastic graph traversal algorithm (described using tensor operations). Thealgorithm is a functional that accept two functions, and can be specialized toobtain a variety of GRL models and objectives, simply by changing those twofunctions. We show for a wide class of methods, our algorithm learns in an unbi-ased fashion and, in expectation, approximates the learning as if the specializedimplementations were run directly. With these capabilities, we scale otherwisenon-scalable methods to set state-of-the-art on large graph datasets while beingmore efﬁcient than existing GRL libraries – with only a handful of lines of codefor each method specialization. GTTF and its various GRL implementations areon: https://github.com/isi-usc-edu/gttf

NTRODUCTION

Graph representation learning (GRL) has become an invaluable approach for a variety of tasks, suchas node classiﬁcation (e.g., in biological and citation networks; Veliˇckovi´c et al. (2018); Kipf &Welling (2017); Hamilton et al. (2017); Xu et al. (2018)), edge classiﬁcation (e.g., link predictionfor social and protein networks; Perozzi et al. (2014); Grover & Leskovec (2016)), entire graphclassiﬁcation (e.g., for chemistry and drug discovery Gilmer et al. (2017); Chen et al. (2018a)), etc.In this work, we propose an algorithmic uniﬁcation of various GRL methods that allows us tore-implement existing GRL methods and introduce new ones, in merely a handful of code lines permethod. Our algorithm (abbreviated GTTF, Section 3.2), receives g raphs as input, t raverses themusing efﬁcient t ensor operations, and invokes specializable f unctions during the traversal. We showfunction specializations for recovering popular GRL methods (Section 3.3). Moreover, since GTTFis stochastic, these specializations automatically scale to arbitrarily large graphs, without carefulderivation per method. Importantly, such specializations, in expectation, recover unbiased gradientestimates of the objective w.r.t. model parameters. To disambiguate: by tensors , we refer to multi-dimensional arrays, as used in Deep Learning literature; andby operations , we refer to routines such as matrix multiplication, advanced indexing, etc a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 20210 1 23 4 (a) Example graph G 

11 1 1 111 11 1  (b) Adjacency matrixfor graph G 

10 2 3 411 41 3    (c) CompactAdj for G withsparse (cid:98) A ∈ Z n × n anddense δ ∈ Z n . We storeIDs of adjacent nodes in (cid:98) A (d) Walk Forest. GTTF in-vokes A C C U M U L A T E F N onceper (green) instance. Figure 1: (c)&(d) Depict our data structure & traversal algorithm on a toy graph in (a)&(b).GTTF uses a data structure (cid:98) A ( Compact Adjacency , Section 3.1): a sparse encoding of the adjacencymatrix. Node v contains its neighbors in row (cid:98) A [ v ] (cid:44) (cid:98) A v , notably, in the ﬁrst degree ( v ) columnsof (cid:98) A [ v ] . This encoding allows stochastic graph traversals using standard tensor operations. GTTFis a functional , as it accepts functions A CCUMULATE F N and B IAS F N , respectively, to be providedby each GRL specialization to accumulate necessary information for computing the objective, andoptionally to parametrize sampling procedure p ( v ’s neighbors | v ) . The traversal internally constructsa walk forest as part of the computation graph. Figure 1 depicts the data structure and the computation.From a generalization perspective, GTTF shares similarities with Dropout (Srivastava et al., 2014).Our contributions are: (i) A stochastic graph traversal algorithm (GTTF) based on tensor operationsthat inherits the beneﬁts of vectorized computation and libraries such as PyTorch and Tensorﬂow. (ii)We list specialization functions, allowing GTTF to approximately recover the learning of a broad classof popular GRL methods. (iii) We prove that this learning is unbiased, with controllable variance.Wor this class of methods, (iv) we show that GTTF can scale previously-unscalable GRL algorithms,setting the state-of-the-art on a range of datasets. Finally, (v) we open-source GTTF along with newstochastic traversal versions of several algorithms, to aid practitioners from various ﬁelds in applyingand designing state-of-the-art GRL methods for large graphs. ELATED W ORK

We take a broad standpoint in summarizing related work to motivate our contribution.

Method F a m il y S c a l e L e a r n i n g Models

GCN, GAT MP (cid:55) exactnode2vec NE (cid:51) approxWYS NE (cid:55) exact

Stochastic Sampling Methods

SAGE MP (cid:51) approxFastGCN MP (cid:51) approxLADIES MP (cid:51) approxGraphSAINT MP (cid:51) approxCluterGCN MP (cid:51) heuristic

Software Frameworks

PyG Both inherits / re-DGL Both implements

Algorithmic Abstraction (ours)

GTTF Both (cid:51) approx

Models for GRL have been proposed, including messagepassing ( MP ) algorithms, such as Graph Convolutional Net-work (GCN) (Kipf & Welling, 2017), Graph Attention (GAT)(Veliˇckovi´c et al., 2018); as well as node embedding ( NE ) algo-rithms, including node2vec (Grover & Leskovec, 2016), WYS(Abu-El-Haija et al., 2018); among many others (Xu et al., 2018;Wu et al., 2019; Perozzi et al., 2014). The full-batch GCN ofKipf & Welling (2017), which drew recent attention and has mo-tivated many MP algorithms, was not initially scalable to largegraphs, as it processes all graph nodes at every training step.To scale MP methods to large graphs, researchers proposed Stochastic Sampling Methods that, at each training step, as-semble a batch constituting subgraph(s) of the (large) inputgraph. Some of these sampling methods yield unbiased gradi-ent estimates (with some variance) including SAGE (Hamiltonet al., 2017), FastGCN (Chen et al., 2018b), LADIES (Zou et al.,2019), and GraphSAINT (Zeng et al., 2020). On the other hand,ClusterGCN (Chiang et al., 2019) is a heuristic in the sensethat, despite its good performance, it provides no guarantee ofunbiased gradient estimates of the full-batch learning. Gilmeret al. (2017) and Chami et al. (2021) generalized many GRLmodels into Message Passing and Auto-Encoder frameworks.2ublished as a conference paper at ICLR 2021These frameworks prompt bundling of GRL methods under

Software Libraries , like PyG (Fey &Lenssen, 2019) and DGL (Wang et al., 2019), offering consistent interfaces on data formats.We now position our contribution relative to the above. Unlike generalized message passing (Gilmeret al., 2017), rather than abstracting the model computation, we abstract the learning algorithm. As aresult, GTTF can be specialized to recover the learning of MP as well as NE methods. Morever, unlikeSoftware Frameworks, which are re-implementations of many algorithms and therefore inherit thescale and learning of the copied algorithms, we re-write the algorithms themselves, giving them newproperties (memory and computation complexity), while maintaining (in expectation) the originalalgorithm outcomes. Further, while the listed Stochastic Sampling Methods target MP algorithms(such as GCN, GAT, alike), as their initial construction could not scale to large graphs, our learningalgorithm applies to a wider class of GRL methods, additionally encapsulating NE methods. Finally,while some NE methods such as node2vec (Grover & Leskovec, 2016) and DeepWalk (Perozzi et al.,2014) are scalable in their original form, their scalability stems from their multi-step process: samplemany (short) random walks, save them to desk, and then learn node embeddings using positionalembedding methods (e.g., word2vec, Mikolov et al. (2013)) – they are sub-optimal in the sense thattheir ﬁrst step (walk sampling) takes considerable time (before training even starts) and also places anartiﬁcial limit on the number of training samples (number of simulated walks), whereas our algorithmconducts walks on-the-ﬂy whilst training.

RAPH T RAVERSAL VIA T ENSOR F UNCTIONALS (GTTF)

At its core, GTTF is a stochastic algorithm that recursively conducts graph traversals to buildrepresentations of the graph. We describe the data structure and traversal algorithm below, using thefollowing notation. G = ( V, E ) is an unweighted graph with n = | V | nodes and m = | E | edges,described as a sparse adjacency matrix A ∈ { , } n × n . Without loss of generality, let the nodes bezero-based numbered i.e. V = { , . . . , n − } . We denote the out-degree vector δ ∈ Z n – it can becalculated by summing over rows of A as δ u = (cid:80) v ∈ V A [ u, v ] . We assume δ u > for all u ∈ V :pre-processing can add self-connections to orphan nodes. B denotes a batch of nodes.3.1 D ATA S TRUCTURE

Internally, GTTF relies on a reformulation of the adjacency matrix, which we term

CompactAdj (for"Compact Adjacency", Figure 1c). It consists of two tensors:1. δ ∈ Z n , a dense out-degree vector (ﬁgure 1c, right)2. (cid:98) A ∈ Z n × n , a sparse edge-list matrix in which the row u contains left-aligned δ u non-zerovalues. The consecutive entries { (cid:98) A [ u, , (cid:98) A [ u, , . . . , (cid:98) A [ u, δ u − } contain IDs of nodesreceiving an edge from node u . The remaining | V | − δ u are left unset, therefore, (cid:98) A onlyoccupies O ( m ) memory when stored as a sparse matrix (Figure 1c, left).CompactAdj allows us to concisely describe stochastic traversals using standard tensor operations. Touniformly sample a neighbor to node u ∈ V , one can draw r ∼ U [0 .. ( δ u − , then get the neighborID with (cid:98) A [ u, r ] . In vectorized form, given node batch B and access to continuous U [0 , , we sampleneighbors for each node in B as: R ∼ U [0 , b , where b = | B | , then B (cid:48) = (cid:98) A [ B, (cid:98) R ◦ δ [ B ] (cid:99) ] is a b -sized vector, with B (cid:48) u containing a neighbor of B u , ﬂoor operation (cid:98) . (cid:99) is applied element-wise, and ◦ is Hadamard product.3.2 S TOCHASTIC T RAVERSAL F UNCTIONAL A LGORITHM

Our traversal algorithm starts from a batch of nodes. It expands from each into a tree, resulting ina walk forest rooted at the nodes in the batch, as depicted in Figure 1d. In particular, given a nodebatch B , the algorithm instantiates | B | seed walkers, placing one at every node in B . Iteratively, eachwalker ﬁrst replicates itself a fanout ( f ) number of times. Each replica then samples and transitions toa neighbor. This process repeats a depth ( h ) number of times. Therefore, each seed walker becomesthe ancestor of a f -ary tree with height h . Setting f = 1 recovers traditional random walk. In practice,we provide ﬂexibility by allowing a custom fanout value per depth.3ublished as a conference paper at ICLR 2021 Algorithm 1:

Stochastic Traverse Functional, parametrized by A C C U M U L A T E F N and B I A S F N . input: u (current node); T ← [] (path leading to u , starts empty); F (list of fanouts); A C C U M U L A T E F N (function: with side-effects and no return. It is model-speciﬁc andrecords information for computing model and/or objective, see text); B I A S F N ← U (function mapping u to distribution on u ’s neighbors, defaults to uniform) def Traverse( T , u , F , A C C U M U L A T E F N , B I A S F N ) : if F .size() = 0 then return f ← F .pop() sample_bias ← B I A S F N ( T , u ) if sample_bias.sum() = 0 then return sample_bias ← sample_bias / sample_bias.sum() K ← Sample ( (cid:98) A [ u, : δ u ]] , sample_bias , f ) f nodes from u ’s neighbors for k ← to f − do T next ← concatenate ( T, [ u ]) A C C U M U L A T E F N ( T next , K [ k ] , f ) Traverse ( T next , K [ k ] , f, A C C U M U L A T E F N , B I A S F N ) def Sample( N , W , f ) : C ← tf.cumsum ( W ) coin_flips ← tf.random.uniform (( f, ) , , indices ← tf.searchsorted ( C , coin_flips ) return N [ indices ] Functional

Traverse is listed in Algorithm 1. It accepts: a batch of nodes ; a list of fanout values F (e.g. to F = [3 , samples 3 neighbors per u ∈ B , then 5 neighbors for each of those); andmore notably, two functions: A C C U M U L A T E F N and B I A S F N . These functions will be called by thefunctional on every node visited along the traversal, and will be passed relevant information (e.g. thepath taken from root seed node). Custom settings of these functions allow recovering wide classes ofgraph learning methods. At a high-level, our functional can be used in the following manner:1. Construct model & initialize parameters (e.g. to random). Deﬁne A C C U M U L A T E F N and B I A S F N .2. Repeat (many rounds):i. Reset accumulation information (from previous round) and then sample batch B ⊂ V .ii. Invoke Traverse on ( B , A C C U M U L A T E F N , B I A S F N ), which invokes the F N ’s, allowing the ﬁrstto accumulate information sufﬁcient for running the model and estimating an objective.iii. Use accumulated information to: run model, estimate objective, apply learning rule (e.g. SGD). A C C U M U L A T E F N is a function that is used to track necessary information for computing the modeland the objective function. For instance, an implementation of DeepWalk (Perozzi et al., 2014) on topof GTTF, specializes A C C U M U L A T E F N to measure an estimate of the sampled softmax likelihood ofnodes’ positional distribution, modeled as a dot-prodct of node embeddings. On the other hand, GCN(Kipf & Welling, 2017) on top of GTTF uses it to accumulate a sampled adjacency matrix, which itpasses to the underlying model (e.g. 2-layer GCN) as if this were the full adjacency matrix. B I A S F N is a function that customizes the sampling procedure for the stochastic transitions. Ifprovided, it must yield a probability distribution over nodes, given the current node and the path thatlead to it. If not provided, it defaults to U , transitioning to any neighbor with equal probability. It canbe deﬁned to read edge weights, if they denote importance, or more intricately, used to parameterizea second order Markov Chain (Grover & Leskovec, 2016), or use neighborhood attention to guidesampling (Veliˇckovi´c et al., 2018), as discussed in the Appendix. Our pseudo-code displays the traversal starting from one node rather than a batch only for clarity, as ouractual implementation is vectorized e.g. u would be a vector of nodes, T would be a 2D matrix with each rowcontaining transition path preceeding the corresponding entry in u , ... etc. Refer to Appendix and code. OME S PECIALIZATIONS OF A C C U M U L A T E F N & B I A S F N ESSAGE P ASSING : G

RAPH C ONVOLUTIONAL VARIANTS

These methods, including (Kipf & Welling, 2017; Hamilton et al., 2017; Wu et al., 2019; Abu-El-Haija et al., 2019; Xu et al., 2018) can be approximated by by initializing (cid:101) A to an empty sparse n × n matrix, then invoking Traverse (Algorithm 1) with u = B ; F to list of fanouts with size h ; Thus A C C U M U L A T E F N and B I A S F N become: def R O O T E D A D J A C C ( T, u, f ) : (cid:101) A [ u, T − ] ← ; (1) def N O R E V I S I T B I A S ( T, u ) : return [ (cid:101) A [ u ] .sum() = 0] (cid:126) δ u δ u ; (2)where (cid:126) n is an n -dimensional all-ones vector, and negative indexing T − k is the k th last entry of T . If a node has been visited through the stochastic traversal, then it already has fanout numberof neighbors and N O R E V I S I T B I A S ensures it does not get revisited for efﬁciency, per line 5 ofAlgorithm 1. Afterwards, the accumulated stochastic (cid:101) A will be fed into the underlying model e.g.for a 2-layer GCN of Kipf & Welling (2017):GCN ( (cid:101) A,X ; W , W ) = softmax ( ◦ A ReLu ( ◦ AXW ) W ); (3)with ◦ A = D (cid:48) / (cid:101) D (cid:48)− (cid:101) A (cid:48) D (cid:48)− / ; D (cid:48) = diag ( δ (cid:48) ); δ (cid:48) = (cid:126) (cid:62) n A (cid:48) ; renorm trick (cid:122) (cid:125)(cid:124) (cid:123)(cid:101) A (cid:48) = I n × n + (cid:101) A Lastly, h should be set to the receptive ﬁeld required by the model for obtaining output d L -dimensionalfeatures at the labeled node batch. In particular, to the number of GC layers multiplied by the numberof hops each layers access. E.g. hops= for GCN but customizable for MixHop and SimpleGCN.3.3.2 N ODE E MBEDDINGS

Given a batch of nodes B ⊆ V , DeepWalk can be implemented in GTTF by ﬁrst initializing loss L to the contrastive term estimating the partition function of log-softmax: L ← (cid:88) u ∈ B log E v ∼ P n ( V ) [ exp ( (cid:104) Z u , Z v (cid:105) )] , (4)where (cid:104) ., . (cid:105) is dot-product notation, Z ∈ R n × d is the trainable embedding matrix with Z i ∈ R d is d -dimensional embedding for node u ∈ V . In our experiments, we estimate the expectation by taking5 samples and we set the negative distribution P n ( V = v ) ∝ δ v , following Mikolov et al. (2013).The functional is invoked with no B I A S F N and A C C U M U L A T E F N = def D E E P W A L K A C C ( T, u, f ) : L ← L − (cid:42) Z i , C T (cid:88) k =1 η [ T − k ] (cid:0) C − k +1 C (cid:1) Z T − k (cid:43) ; η [ u ] ← η [ T − f ; (5)where hyperparameter C indicates maximum window size (inherited from word2vec, Mikolov et al.,2013), in the summation on k does not access invalid entries of T as C T (cid:44) min( C, T .size ) , the scalarfraction (cid:0) C − k +1 C (cid:1) is inherited from context sampling of word2vec (Section 3.1 in Levy et al., 2015),and rederived for graph context by Abu-El-Haija et al. (2018), and η [ u ] stores a scalar per node onthe traversal Walk Forest, which defaults to 1 for non-initialized entries, and is used as a correctionterm. DeepWalk conducts random walks (visualized as a straight line) whereas our walk tree has abranching factor of f . Setting fanout f = 1 recovers DeepWalk’s simulation, though we found f > outperforms within fewer iterations e.g. f = 5 , within 1 epoch, outperforms DeepWalk’s publishedimplementation. Learning can be performed using the accumulated L as: Z ← Z − (cid:15) ∇ Z L ; HEORETICAL A NALYSIS

Due to space limitations, we include the full proofs of all propositions in the appendix. Before feeding the batch to model, in practice, we ﬁnd nodes not reached by traversal and remove theircorresponding rows (and also columns) from X (and A ). We present more methods in the Appendix.

STIMATING k TH POWER OF TRANSITION MATRIX

We show that it is possible with GTTF to accumulate an estimate of transition T matrix to power k .Let Ω denote the walk forest generated by GTTF, Ω( u, k, i ) as the i th node in the vector of nodes atdepth k of the walk tree rooted at u ∈ B , and t u,v,ki as the indicator random variable [Ω( u, k, i ) = v ] .Let the estimate of the k th power of the transition matrix be denoted (cid:98) T k . Entry (cid:98) T ku,v should be anunbiased estimate of T ku,v for u ∈ B , with controllable variance. We deﬁne: (cid:98) T ku,v = (cid:80) f k i =1 t u,v,ki f k (6)The fraction in Equation 6 counts the number of times the walker starting at u visits v in Ω , dividedby the total number of nodes visited at the k th step from u . Proposition 1. (U NBIASED T K ) (cid:98) T ku,v as deﬁned in Equation 6, is an unbiased estimator of T ku,v Proposition 2. (V ARIANCE T K ) Variance of our estimate is upper-bounded: Var [ (cid:98) T ku,v ] ≤ f k Naive computation of k th powers of the transition matrix can be efﬁciently computed via repeatedsparse matrix-vector multiplication. Speciﬁcally, each column of T k can be computed in O ( mk ) ,where m is the number of edges in the graph. Thus, computing T k in its entirety can be accomplishedin O ( nmk ) . However, this can still become prohibitively expensive if the graph grows beyond acertain size. GTTF on the other hand can estimate T k in time complexity independent of the size ofthe graph, (Prop. 8), with low variance. Transition matrix powers are useful for many GRL methods.(Qiu et al., 2018)4.2 U NBIASED L EARNING

As a consequence of Propositions 1 and 2, GTTF enables unbiased learning with variance control forclasses of node embedding methods, and provides a convergence guarantee for graph convolutionmodels under certain simplifying assumptions.We start by analyzing node embedding methods. Speciﬁcally, we cover two general types. The ﬁrst isbased on matrix factorization of the power-series of transition matrix. and the second is based oncross-entropy objectives, e.g., like DeepWalk (Perozzi et al., 2014), node2vec (Grover & Leskovec,2016), These two are shown in Proposations 3 and 4

Proposition 3. (U NBIASED TF ACTORIZATION ) Suppose L = || LR − (cid:80) k c k T k || F , i.e. factoriza-tion objective that can be optimized by gradient descent by calculating ∇ L,R L , where c k ’s are scalarcoefﬁcients. Let its estimate (cid:98) L = || LR − (cid:80) k c k (cid:98) T k || F , where (cid:98) T is obtained by GTTF according toEquation 6. Then E [ ∇ L,R (cid:98) L ] = ∇ L,R L . Proposition 4. (U NBIASED L EARN

NE)

Learning node embeddings Z ∈ R n × d with objectivefunction L , decomposable as L ( Z ) = (cid:80) u ∈ V L ( Z, u ) − (cid:80) u,v ∈ V (cid:80) k L ( T k , u, v ) L ( Z, u, v ) ,where L is linear over T k , then using (cid:98) T k yields an unbiased estimate of ∇ Z L . Generally, L (and L ) score the similarity between disconnected (and connected) nodes u and v . The above form of L covers a family of contrastive learning objectives that use cross-entropyloss and assume a logistic or (sampled-)softmax distributions. We provide, in the Appendix, thedecompositions for the objectives of DeepWalk (Perozzi et al., 2014), node2vec (Grover & Leskovec,2016) and WYS (Abu-El-Haija et al., 2018). Proposition 5. (U NBIASED

MP)

Given input activations, H ( l − , graph conv layer ( l ) can userooted adjacency (cid:101) A accumulated by R O O T E D A D J A C C (1), to provide unbiased pre-activation output,i.e. E (cid:104) ◦ A k H ( l − W ( l ) (cid:105) = (cid:16) D (cid:48) − / A (cid:48) D (cid:48) − / (cid:17) k H ( l − W ( l ) , with A (cid:48) and D (cid:48) deﬁned in (3). Proposition 6. (U NBIASED L EARN

MP)

If objective to a graph convolution model is convex andLipschitz continous, with minimizer θ ∗ , then utilizing GTTF for graph convolution converges to θ ∗ . OMPLEXITY A NALYSIS

Proposition 7. S TORAGE complexity of GTTF is O ( m + n ) . Proposition 8. T IME complexity of GTTF is O ( bf h ) for batch size b , fanout f , and depth h . Proposition 8 implies the speed of computation is irrespective of graph size. Methods implementedin GTTF inherit this advantage. For instance, the node embedding algorithm WYS (Abu-El-Haijaet al., 2018) is O ( n ) , however, we apply its GTTF implementation on large graphs. XPERIMENTS

We conduct experiments on 10 different graph datasets, listed in in Table 1. We experimentallydemonstrate the following. (1) Re-implementing baseline method using GTTF maintains performance.(2) Previously-unscalable methods, can be made scalable when implemented in GTTF. (3) GTTFachieves good empirical performance when compared to other sampling-based approaches hand-designed for Message Passing. (4) GTTF consumes less memory and trains faster than other popularSoftware Frameworks for GRL. To replicate our experimental results, for each cell of the table in ourcode repository, we provide one shell script to produce the metric, except when we indicate that themetric is copied from another paper. Unless otherwise stated, we used fanout factor of 3 for GTTFimplementations. Learning rates and model hyperparameters are included in the Appendix.5.1 N

ODE E MBEDDINGS FOR L INK P REDICTION

In link prediction tasks, a graph is partially obstructed by hiding a portion of its edges. The taskis to recover the hidden edges. We follow a popular approach to tackle this task: ﬁrst learn nodeembedding Z ∈ R n × d from the observed graph, then predict the link between nodes u and v withscore ∝ Z (cid:62) u Z v . We use two ranking metrics for evaluations: ROC-AUC, which is a ranking objective:how well do methods rank the hidden edges above randomly-sampled negative edges and Mean Rank.We re-implement Node Embedding methods, DeepWalk (Perozzi et al., 2014) and WYS (Abu-El-Haija et al., 2018), into GTTF (abbreviated F ). Table 2 summarizes link prediction test performance.LiveJournal and Reddit are large datasets, where original implementation of WYS is unable to scaleto. However, scalable F (WYS) sets new state-of-the-art on these datasets. For PPI and HepThdatasets, we copy accuracy numbers for DeepWalk and WYS from (Abu-El-Haija et al., 2018). ForLiveJournal, we copy accuracy numbers for DeepWalk and PBG from (Lerer et al., 2019) – notethat a well-engineered approach (PBG, (Lerer et al., 2019)), using a mapreduce-like framework, isunder-performing compared to F (WYS), which is a few lines specialization of GTTF.5.2 M ESSAGE P ASSING FOR N ODE C LASSIFICATION

We implement in GTTF the message passing models: GCN (Kipf & Welling, 2017), GraphSAGE(Hamilton et al., 2017), MixHop (Abu-El-Haija et al., 2019), SimpleGCN (Wu et al., 2019), as theircomputation is straight-forward. For GAT (Veliˇckovi´c et al., 2018) and GCNII (Chen et al., 2020), asthey are more intricate, we download the authors’ codes, and wrap them as-is with our functional.We show that we are able to run these models in Table 3 (left and middle), and that GTTF implemen-tations matches the baselines performance. For the left table, we copy numbers from the publishedpapers. However, we update GAT to work with TensorFlow 2.0 and we use our updated code (GAT*).5.3 E

XPERIMENTS COMPARING AGAINST S AMPLING METHODS FOR M ESSAGE P ASSING

We now compare models trained with GTTF (where samples are walk forests) against samplingmethods that are especially designed for Message Passing algorithms (GraphSAINT and ClusterGCN),especially since their sampling strategies do not match ours.Table 3 (right) shows test performance on node classiﬁcation accuracy on a large dataset: Products.We calculate the accuracy for F (SAGE), but copy from (Hu et al., 2020) the accuracy for the baselines:GraphSAINT (Zeng et al., 2020) and ClusterGCN (Chiang et al., 2019) (both are message passingmethods); and also node2vec (Grover & Leskovec, 2016) (node embedding method).7ublished as a conference paper at ICLR 2021Table 1: Dataset summary. Tasks are LP, SSC, FSC, for link prediction, semi- and fully-supervisedclassiﬁcation. Split indicates the train/validate/test paritioning, with (a) = (Abu-El-Haija et al., 2018),(b) = to be released, (c) = (Hamilton et al., 2017), (d) = (Yang et al., 2016); (e) = (Hu et al., 2020). Dataset Split

PPI (a) 3,852 20,881 N/A proteins interaction LPca-HepTh (a) 80,638 24,827 N/A researchers co-authorship LPca-AstroPh (a) 17,903 197,031 N/A researchers co-authorship LPLiveJournal (b) 4.85M 68.99M N/A users friendship LPReddit (c) 233,965 11.60M 41 posts user co-comment LP/FSCAmazon (b) 2.6M 48.33M 31 products co-purchased FSCCora (d) 2,708 5,429 7 articles citation SSCCiteSeer (d) 3,327 4,732 6 articles citation SSCPubMed (d) 19,717 44,338 3 articles citation SSCProducts (e) 2.45M 61.86M 47 products co-purchased SSCTable 2: Results of node embeddings on Link Prediction.

Left : Test ROC-AUC scores.

Right : MeanRank on the right for consistency with Lerer et al. (2019). *OOM = Out of Memory.

PPI HepTh Reddit

DeepWalk F (DeepWalk) WYS

OOM F (WYS) LiveJournal

DeepWalk

PBG

WYS

OOM* F (WYS) Table 3: Node classiﬁcation tasks.

Left : test accuracy scores on semi-supervised classiﬁcation (SSC)of citation networks.

Middle : test micro-F1 scores for large fully-supervised classiﬁcation.

Right :test accuracy on an SSC task, showing only scalable baselines. We bold the highest value per column.

Cora Citeseer Pubmeb

GCN F (GCN) MixHop F (MixHop) GAT* F (GAT) GCNII F (GCNII) Reddit Amazon

SAGE F (SAGE) SimpGCN F (SimpGCN) Products node2vec

ClusterGCN

GraphSAINT F (SAGE) Left : Speed is the per epoch timein seconds when training GraphSAGE. Memory is the memory in GB used when training GCN. Allexperiments conducted using an AMD Ryzen 3 1200 Quad-Core CPU and an Nvidia GTX 1080TiGPU.

Right : Training curve for GTTF and PyG implementations of Node2Vec.

Speed (s) Memory (GB)

Reddit Products Reddit Cora Citeseer Pubmed

DGL 17.3 13.4 OOM 1.1 1.1 1.1PyG 5.8 9.2 OOM 1.2 1.3 1.6GTTF

Time (s) R O C - A U C framework (node2vec)PyG(node2vec) UNTIME AND M EMORY COMPARISON AGAINST OPTIMIZED S OFTWARE F RAMEWORKS

In addition to the accuracy metrics discussed above, we also care about computational performance.We compare against software frameworks DGL (Wang et al., 2019) and PyG (Fey & Lenssen,2019). These software frameworks offer implementations of many methods. Table 4 summarizesthe following. First (left), we show time-per-epoch on large graphs of their implementation of8ublished as a conference paper at ICLR 2021GraphSAGE, compared with GTTF’s, where we make all hyper parameters to be the same (of modelarchitecture, and number of neighbors at message passing layers). Second (middle), we run theirGCN implementation on small datasets (Cora, Citeseer, Pubmed) to show peak memory usage. Therun times between GTTF, PyG and DGL are similar for these datasets. The comparison can befound in the Appendix. While the aforementioned two comparisons are on popular message passingmethods, the third (right) chart shows a popular node embedding method: node2vec’s link predictiontest ROC-AUC in relation to its training runtime.

ONCLUSION

We present a new algorithm, Graph Traversal via Tensor Functionals (GTTF) that can be specialized tore-implement the algorithms of various Graph Representation Learning methods. The specializationtakes little effort per method, making it straight-forward to port existing methods or introduce newones. Methods implemented in GTTF run efﬁciently as GTTF uses tensor operations to traversegraphs. In addition, the traversal is stochastic and therefore automatically makes the implementationsscalable to large graphs. We theoretically show that the learning outcome due to the stochastictraversal is in expectation equivalent to the baseline when the graph is observed at-once, for popularGRL methods we analyze. Our thorough experimental evaluation conﬁrms that methods implementedin GTTF maintain their empirical performance, and can be trained faster and using less memory evencompared to software frameworks that have been thoroughly optimized.

CKNOWLEDGEMENTS

We acknowledge support from the Defense Advanced Research Projects Agency (DARPA) underaward FA8750-17-C-0106. R EFERENCES

Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step:Learning node embeddings via graph attention. In

Advances in Neural Information ProcessingSystems , 2018.Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, KristinaLerman, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional archi-tectures via sparsiﬁed neighborhood mixing. In

International Conference on Machine Learning ,2019.Pierre Baldi and Peter Sadowski. The dropout learning algorithm. In

Artiﬁcial Intelligence , 2014.Leon Bottou. Online algorithms and stochastic approximations. In

Online Learning and NeuralNetworks , 1998.Leon Bottou. Stochastic learning. In

Advanced Lectures on Machine Learning, Lecture Notes inArtiﬁcial Intelligence, vol. 3176, Springer Verlag , 2004.Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. Machinelearning on graphs: A model and comprehensive taxonomy, 2021.Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. The rise ofdeep learning in drug discovery. In

Drug discovery today , 2018a.Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: Fast learning with graph convolutional networks viaimportance sampling. In

International Conference on Learning Representation , 2018b.Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graphconvolutional networks. In

International Conference on Machine Learning , 2020.Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn: Anefﬁcient algorithm for training deep and large graph convolutional networks. In

ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining , 2019.9ublished as a conference paper at ICLR 2021Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In

ICLR Workshop on Representation Learning on Graphs and Manifolds , 2019.Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neuralmessage passing for quantum chemistry. In

International Conference on Machine Learning , 2017.Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , 2016.William Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs.In

Advances in Neural Information Processing Systems , 2017.Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta,and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In arXiv ,2020.Thomas Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks.In

International Conference on Learning Representations , 2017.Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and AlexPeysakhovich. Pytorch-biggraph: A large-scale graph embedding system. In

The Conference onSystems and Machine Learning , 2019.Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learnedfrom word embeddings. In

Transactions of the Association for Computational Linguistics , 2015.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In

Advances in Neural Information ProcessingSystems , 2013.Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social represen-tations. In

ACM SIGKDD international conference on Knowledge discovery & Data Mining ,2014.Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embeddingas matrix factorization: Unifying deepwalk, line, pte, and node2vec. In

ACM InternationalConference on Web Search and Data Mining , 2018.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting. In

Journal of MachineLearning Research , 2014.Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and YoshuaBengio. Graph attention networks. In

International Conference on Learning Representations ,2018.Petar Veliˇckovi´c, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R DevonHjelm. Deep graph infomax. In

International Conference on Learning Representations , 2019.Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma,Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang.Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXivpreprint arXiv:1909.01315 , 2019.Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplify-ing graph convolutional networks. In

International Conference on Machine Learning , 2019.Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken ichi Kawarabayashi, and StefanieJegelka. Representation learning on graphs with jumping knowledge networks. In

InternationalConference on Machine Learning , 2018.Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning withgraph embeddings. In

International Conference on Machine Learning , 2016.10ublished as a conference paper at ICLR 2021Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graph-SAINT: Graph sampling based inductive learning method. In

International Conference on LearningRepresentations , 2020.Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Few-shot repre-sentation learning for out-of-vocabulary words. In

Advances in Neural Information ProcessingSystems , 2019. A PPENDIX

A H

YPERPARAMETERS

For the general link prediction tasks we used a | B | = | V | , C = 5 , f = 3 , 10 negative samples peredge, Adam optimizer with a learning rate of 0.5, multiplied by a factor of 0.2, every 50 steps, for200 total iterations. The differences are listed below.The Reddit dataset was trained using a starting learning rate of 2.0, decaying 50% every 10 iterations.The LiveJournal task was trained using a ﬁxed learning rate of 0.001, | B | = 5000 , f = 2 , and 50negative samples per edge.For the node classiﬁcations tasks:For F ( SimpleGCN ) on Amazon, we use f = [15 , , a batch size of 1024, and a learning rate of0.02, decaying by a factor 0f 0.2 after 2 and 6 epochs for a total of 25 epochs. On Reddit, it is thesame except f = [25 , For F ( SAGE ) on Amazon we use f = [20 , , a two layer model, a batch size of 256, and ﬁxedlearning rates of 0.001 and 0.002 respectively. On reddit we use f = [25 , , a ﬁxed learning rateof 0.001, hidden dimension of 256 and a batch size of 1024. On the Products dataset, we used f = [15 , , a ﬁxed learning learning rate of 0.001 and a batch size of 1024, a hidden dimension of256 and a ﬁxed learning rate of 0.003.For GAT (baseline), we follow the authors code and hyperparameters: for Cora and Citeseer, weuse Adam with learning rate of 0.005, L2 regularization of 0.0005, 8 attention heads on the ﬁrstlayer and 1 attention head on the output layer. For Pubmed, we use Adam with learning rate of 0.01,L2 regularization of 0.01, 8 attention heads on the ﬁrst layer and 8 attention heads on the outputlayer. For F ( GAT ) , we use the same aforementioned hyperparameters, a fanout of 3 and traversaldepth of 2 (to cover two layers) i.e. F = [3 , . For F ( GCN ) , we use the authors’ recommendedhyperparameters. Learning rate of 0.005, 0.001 L2 regularization, and F = [3 , , for all datasets.For both methods, we apply “patience” and stop the training if validation loss does not improve for100 consecutive epochs, reporting the test accuracy at the best validation loss. For F ( MixHop ) , wewrap the authors’ script and use their hyperparameters. For F ( GCNII ) , we use F = [5 , , , , , ,as their models are deep (64 layers for Cora). Otherwise, we inherit their network hyperparameters(latent dimensions, number of layers, dropout factor, and their introduced coefﬁcients), as they havetuned them per dataset, but we change the learning rate to . (half of what they use) and we extendthe patience from 100 to 1000, and extend the maximum number of epochs from 1500 to 5000 – thisis because we are presenting a subgraph at each epoch, and therefore we intuitively want to slowdown the learning per epoch, which is similar to the practice when someone applies Dropout to aneural networks. We re-run their shell scripts, with their code modiﬁed to use the Rooted Adjacencyrather than the real adjacency, which is sampled at every epoch.MLP was trained with 1 layer and a learning rate of 0.01.11ublished as a conference paper at ICLR 2021 B P

ROOFS

B.1 P

ROOF OF P ROPOSITION Proof. E [ (cid:98) T ku,v ] = E (cid:20) fk (cid:80) i =1 t u,v,ki f k (cid:21) = f k (cid:80) i =1 E [ t u,v,ki ] f k = f k (cid:80) i =1 P [ t u,v,ki = 1] f k = f k (cid:80) i =1 T ku,v f k = T ku,v B.2 P

ROOF OF P ROPOSITION Proof.

Var [ (cid:98) T ku,v ] = (cid:80) f k i =1 V ar [ t u,v,ki ] f k = f k T ku,v (1 − T ku,v ) f k = T ku,v (1 − T ku,v ) f k Since ≤ T ku,v ≤ , then T ku,v (1 − T ku,v ) is maximized with T ku,v = 12 . Hence V ar [ (cid:98) T ku,v ] ≤ f k B.3 P

ROOF OF P ROPOSITION Proof.

Consider a d -dimensional factorization of (cid:80) k c k T k , where c k ’s are scalar coefﬁcients: L = 12 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) LR − (cid:88) k c k T k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F , (7)parametrized by L, R (cid:62) ∈ R n × d . The gradients of L w.r.t. parameters are: ∇ L L = (cid:32) LR (cid:62) − (cid:88) k c k T k (cid:33) R (cid:62) and ∇ R L = L (cid:62) (cid:32) LR (cid:62) − (cid:88) k c k T k (cid:33) . (8)Given estimate objective L (replacing (cid:98) T with using GTTF-estimated (cid:98) T ): (cid:98) L = 12 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) LR − (cid:88) k c k (cid:98) T k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F . (9)It follows that: E (cid:104) ∇ L (cid:98) L (cid:105) = E (cid:34)(cid:32) LR (cid:62) − (cid:88) k c k (cid:98) T k (cid:33) R (cid:62) (cid:35) = E (cid:34)(cid:32) LR (cid:62) − (cid:88) k c k (cid:98) T k (cid:33)(cid:35) R (cid:62) Scaling property of expectation = (cid:32) LR (cid:62) − (cid:88) k c k E (cid:104) (cid:98) T k (cid:105) (cid:33) R (cid:62) Linearity of expectation = (cid:32) LR (cid:62) − (cid:88) k c k T k (cid:33) R (cid:62) Proof of Proposition 1 = ∇ L L The above steps can similarly be used to show E (cid:104) ∇ R (cid:98) L (cid:105) = ∇ R L B.4 P

ROOF OF P ROPOSITION Proof.

We want to show that E [ ∇ Z L ( (cid:98) T k , Z )] = ∇ Z L ( T k , Z ) . Since the terms of L are unaf-fected by (cid:98) T , they are excluded w.l.g. from L in the proof. E [ ∇ Z L ( (cid:98) T k , Z )] = E (cid:2) ∇ Z ( − (cid:88) u,v ∈ V (cid:88) k ∈{ ..C } L ( (cid:98) T k , u, v ) L ( Z, u, v )) (cid:3) = −∇ Z (cid:88) u,v ∈ V (cid:88) k ∈{ ..C } L ( E [ (cid:98) T k ] , u, v ) L ( Z, u, v ) (by Prop 1) = −∇ Z (cid:88) u,v ∈ V (cid:88) k ∈{ ..C } L ( T k , u, v ) L ( Z, u, v ) = ∇ Z L ( T k , Z ) The following table gives the decomposition for DeepWalk, node2vec, and Watch Your Step.Node2vec also introduces a biased sampling procedure based on hyperparameters (they name p and q ) instead of uniform transition probabilities. We can equivalently bias the transitions in GTTF tomatch node2vec’s. This would then show up as a change in (cid:98) T k in the objective. This effect can alsobe included in the objective by multiplying (cid:104) Z u , Z v (cid:105) by the probability of such a transition in L . Inthis format, the p and q variables appear in the objective and can be included in the optimization. ForWYS, Q k are also trainable parameters.Method L L L DeepWalk (cid:18) C − k + 1 C (cid:19) T k log( (cid:104) Z u , Z v (cid:105) ) Node2Vec log (cid:0)(cid:80) v ∈ V exp( (cid:104) Z u , Z v (cid:105) ) (cid:1) (cid:18) C − k + 1 C (cid:19) T k (cid:104) Z u , Z v (cid:105) Watch Your Step log(1 − σ ( (cid:104) L u , R v (cid:105) )) Q k T k log( σ ( (cid:104) L u , R v (cid:105) )) Table 5: Decomposition of graph embedding methods to demonstrate unbiased learning. For WYS, Z u = concatenate ( L u , R u ) .For methods in which the transition distribution is not uniform, such as node2vec, there are twooptions for incorporating this distribution in the loss. The obvious choice is to sample from a biasedtransition matrix, T u,v = (cid:102) W u,v , where (cid:102) W is the transition weights. Alternatively, the transition biascan be used as a weight on the objective itself. This approach is still unbiased as E v ∼ (cid:102) W u [ L ( v, u )] = (cid:80) v ∈ V P v ∼ (cid:102) W u [ v ] L ( v, u ) = (cid:80) v ∈ V (cid:102) W u,v L ( v, u ) B.5 P

ROOF OF P ROPOSITION Proof.

Let (cid:101) A be the neighborhood patch returned by GTTF, and let (cid:101) . indicate a measurement based onthe sampled graph, (cid:101) A , such as the degree vector, (cid:101) δ , or diagonal degree matrix, (cid:101) D . For the remainder ofthis proof, let all notation for adjacency matrices, A or (cid:101) A , and diagonal degree matrices, D or (cid:101) D , anddegree vector, δ , refer to the corresponding measure on the graph with self loops e.g. A ← A + I n × n .We now show that the expectation of the layer output is unbiased. E (cid:104) ◦ A k H ( l − W ( l ) (cid:105) = (cid:104) E (cid:104) ◦ A k (cid:105) H ( l − W ( l ) (cid:105) implies that E (cid:104) ◦ A k H ( l − W ( l ) (cid:105) is unbiased if E (cid:104) ◦ A k (cid:105) = (cid:0) D − / AD − / (cid:1) k . E (cid:104) ◦ A k (cid:105) = E (cid:20) D / (cid:16) (cid:101) D − (cid:101) A (cid:17) k D − / (cid:21) = D / E (cid:20)(cid:16) (cid:101) D − (cid:101) A (cid:17) k (cid:21) D − / Let P u,v,k be the set of all walks { p = ( u, v , ..., v k − , v ) | v i ∈ V } , and let p ∃ (cid:101) A indicate that thepath p exists in the graph given by (cid:101) A . Let t u,v,k be the transition probability from u to v in k steps,and let t p be the probability of a random walker traversing the graph along path p . E (cid:20)(cid:16) (cid:101) D − (cid:101) A (cid:17) ku,v (cid:21) = E (cid:104) (cid:101) T ku,v (cid:105) = P r (cid:2)(cid:101) t u,v,k = 1 (cid:3) = (cid:88) p ∈P u,v,k P r [ p ∃ (cid:101) A ] P r [ (cid:101) t p = 1 | p ∃ (cid:101) A ] = (cid:88) p ∈P u,v,k k (cid:89) i =1 [ A [ p i , p i +1 ] = 1] f + 1 δ [ p i ] k (cid:89) i =1 ( f + 1) − = (cid:88) p ∈P u,v,k k (cid:89) i =1 [ A [ p i , p i +1 ] = 1] δ [ p i ] − = (cid:88) p ∈P u,v,k P r [ t p = 1] = P r [ t u,v,k ] = ( T ) ku,v = (cid:0) D − A (cid:1) ku,v Thus, E (cid:104) ◦ A k (cid:105) = (cid:0) D − / AD − / (cid:1) k and E (cid:104) ◦ A k H ( l − W ( l ) (cid:105) = (cid:0) D − / AD − / (cid:1) k H ( l − W ( l ) For writing, we assumed nodes have degree, δ u ≥ f , though the proof still holds if that is not the caseas the probability of an outgoing edge being present from u becomes and the transition probabilitybecomes δ − u i.e. the same as no estimate at all.B.6 P ROOF OF P ROPOSITION with H : H = GCN X ( A ; W ) = T XW We restrict our analysis to GCNs with linear activations. We are interested in quantifying the changeof H as A changes, and therefore the ﬁxed (always visible) features X is placed on the subscript. Let (cid:101) A denote adjacency accumulated by GTTF’s R O O T E D A D J A C C (Eq. 1). (cid:101) H c = GCN X ( (cid:101) A c ) . Let A = { (cid:101) A c } |A| c =1 denote the (countable) set of all adjacency matrices realizable by GTTF. For theanalysis, assume the graph is α -regular: the assumption eases the notation though it is not needed.Therefore, degree δ u = α for all u ∈ V . Our analysis depends on |A| (cid:80) (cid:101) A ∈A (cid:101) A ∝ A . i.e. theaverage realizable matrix by GTTF is proportional (entry-wise) to the full adjacency. This is can beshown when considering one-row at a time: given node u with δ u = α outgoing neighbors, each ofits neighbors has the same appearance probability = δ u . Summing over all combinations (cid:0) δ u f (cid:1) , makeseach edge appear the same frequency = δ u |A| , noting that |A| evenly divides (cid:0) δ u f (cid:1) for all u ∈ V .We deﬁne a dropout module: d A = |A| (cid:88) c z c (cid:101) A c with z ∼ Categorical (cid:32) |A| of them (cid:122) (cid:125)(cid:124) (cid:123) |A| , |A| , . . . , |A| (cid:33) , (10)where z c acts as Multinoulli selector over the elements of A , with one of its entries set to 1 andall others to zero. With this deﬁnitions, GCNs can be seen in the droupout framework as: (cid:101) H = GCN X ( d A ) . Nonetheless, in order to inherit the analysis of (Baldi & Sadowski, 2014, see theirequations 140 & 141), we need to satisfy two conditions which their analysis is founded upon:(i) E [ GCN X ( d A )] = GCN X ( A ) : in the usual (feature-wise) dropout, such condition is easilyveriﬁed.(ii) Backpropagated error signal does not vary too much around around the mean, across allrealizations of d A .Condition (i) is satisﬁed due to proof of Proposition 5. To analyze the error signal, i.e. the gradient ofthe error w.r.t. the network, assume loss function L ( H ) , outputs scalar loss, is λ -Lipschitz continuous. The following deﬁnition averages the node features (uses non-symmetric normalization) and appears inmultiple GCN’s including Hamilton et al. (2017). If not α -regular, it would be |A| (cid:80) (cid:101) A ∈A (cid:101) A ∝ D − A L ( H ) and L ( (cid:101) H ) : || ∇ H L ( H ) − ∇ H L ( (cid:101) H ) || (a) ≤ λ ( ∇ H L ( H ) − ∇ H L ( (cid:101) H )) (cid:62) ( H − (cid:101) H ) (11) (b) ≤ λ || ∇ H L ( H ) − ∇ H L ( (cid:101) H ) || || H − (cid:101) H || (12) w.p. ≥ − Q ≤ λ || ∇ H L ( H ) − ∇ H L ( (cid:101) H ) || W (cid:62) X (cid:62) Q (cid:112) V ar [ T ] XW (13) = λQ √ f || ∇ H L ( H ) − ∇ H L ( (cid:101) H ) || || W || || X || (14) || ∇ H L ( H ) − ∇ H L ( (cid:101) H ) || ≤ λQ √ f || W || || X || (15)where (a) is by Lipschitz continuity, (b) is by Cauchy–Schwarz inequality, “w.p.” means withprobability and uses Chebyshev’s inequality, with the following equality because the variance of T is shown element-wise in proof for Prop. 2. Finally, we get the last line by dividing bothsides over the common term. This shows that one can make the error signal for the differentrealizations arbitrarily small, for example, by choosing a larger fanout value or putting (convex)norm constraints on W and X e.g. through batchnorm and/or weightnorm. Since we can have ∇ H L ( H ) ≈ ∇ H L ( (cid:101) H ) ≈ ∇ H L ( (cid:101) H ) ≈ · · · ≈ ∇ H L ( (cid:101) H |A| ) with high probability, then theanalysis of Baldi & Sadowski (2014) applies. Effectively, it can be thought of as an online learningalgorithm where the elements of A are the stochastic training examples and analyzed per (Bottou,1998; 2004), as explained by Baldi & Sadowski (2014) (cid:3) .B.7 P ROOF OF P ROPOSITION

CompactAdj is O ( sizeof ( δ ) + sizeof ( (cid:98) A )) = O ( n + m ) .Moreover, for extemely large graphs, the adjacncy can be row-wise partitioned across multiplemachines and therefore admitting linear scaling. However, we acknolwedge that choosing whichrows to partition to which machines can drastically affect the performance. Balanced partitioning isideal. It is an NP-hard problem, but many approximations have been proposed. Nonetheless, reducinginter-communication, when distributing the data structure across machines, is outside our scope.B.8 P ROOF OF P ROPOSITION O ( bh f ) . This follows trivially from theGTTF functional: each nodes in batch ( b of them) builds a tree with depth h and fanout f i.e. with h f tree nodes. This calculation assumes random number generation, A C C U M U L A T E F N and B I A S F N take constant time. The searchsorted function is linear, as it is called on a sorted list: cumulativesum of probabilities. C A

DDITIONAL

GTTF I

MPLEMENTATIONS

C.1 M

ESSAGE P ASSING I MPLEMENTATIONS

C.1.1 G

RAPH A TTENTION N ETWORKS (GAT, V

ELI ˇCKOVI ´C ET AL ., 2018)One can implement GAT by following the previous subsection, utilizing A C C U M U L A T E F N and B I A S F N deﬁned in (1) and (2), but just replacing the model (3) by GAT’s:GAT ( (cid:101) A, X ; A , W , W ) = softmax (( A ◦ ◦ A ) ReLu (( A ◦ ◦ A ) XW ) W ); (16)where ◦ is hadamard product and A is an n × n matrix placing a positive scalar (an attention value)on each edge, parametrized by multi-headed attention described in (Veliˇckovi´c et al., 2018). However,for some high-degree nodes that put most of the attention weight on a small subset of their neighbors,sampling uniformly (with B I A S F N = N O R E V I S I T B I A S ) might mostly sample neighbors with entriesin A with value ≈ , and could require more epochs for convergence. However, our ﬂexible functional15ublished as a conference paper at ICLR 2021allows us to propose a sample-efﬁcient alternative, that is in expectation, equivalent to the above:GAT ( (cid:101) A, X ; A , W , W ) = softmax (( √A ◦ ◦ A ) ReLu (( √A ◦ ◦ A ) XW ) W ); (17) def G A T B I A S ( T, u ) : return N O R E V I S I T B I A S ( T, u ) ◦ (cid:113) A [ u, (cid:98) A [ u ]] ; (18)C.1.2 D EEP G RAPH I NFOMAX (DGI, V

ELI ˇCKOVI ´C ET AL ., 2019)DGI implementation on GTTF can use A C C U M U L A T E F N = R O O T E D A D J A C C , deﬁned in (1). Tocreate the positive graph: it can sample some nodes B ⊂ V . It would pass to GTTF’s Traverse B ,and utilize the accumulated adjacency (cid:98) A for running: GCN ( (cid:98) A, X B ) and GCN ( (cid:98) A, X permute ) , wherethe second run randomly permutes the order of nodes in X . Finally, the output of those GCNs canthen be fed into a readout function which outputs to a descriminator trying to classify if the readoutlatent vector correspond to the real, or the permuted features.C.2 N ODE E MBEDDING I MPLEMENTATIONS

C.2.1 N

ODE EC (G ROVER & L

ESKOVEC , 2016)A simple implementation follows from above: N2 V A C C (cid:44) D E E P W A L K A C C ; but override B I A S F N = def N2 V B I A S ( T, u ) : return p − [ i = T − ] q − [ (cid:104) A [ T − ] ,A [ u ] (cid:105) > ; (19)where denotes indicator function, p, q > are hyperparameters of node2vec assigning (un-normalized) probabilities for transitioning back to the previous node or to node connected to it. (cid:104) A [ T − ] , A [ u ] (cid:105) counts mutual neighbors between considered node u and previous T − .An alternative implementation is to not override B I A S F N but rather fold it into A C C U M U L A T E F N ,as: def N2 V A C C ( T, u, f ) : D E E P W A L K A C C ( T, u, f ) ; η [ u ] ← η [ u ] × N2 V B I A S ( T, u ) ; (20)Both alternatives are equivalent in expectation. However, the latter directly exposes the parameters p and q to the objective L : allowing them to be differentiable w.r.t. L and therefore trainable viagradient descent, rather than by grid-search. Nonetheless, parameterizing p & q is beyond our scope.C.2.2 W ATCH Y OUR S TEP (WYS, A BU -E L -H AIJA ET AL ., 2018)First, embedding dictionaries

R, L ∈ R n × d can be initialized to random. Then repeatedly overbatches B ⊆ V , the loss L can be initialized to estimate the negative part of the objective: L ← − (cid:88) u ∈ B log σ ( − E v ∈U ( V ) [ (cid:104) R u , L v (cid:105) + (cid:104) R v , L u (cid:105) ]) , Then call GTTF’s traverse passing the following A C C U M U L A T E F N = def W Y S A C C ( T, u ) : if T.size () (cid:54) = Q.size () : return ; t ← T [0] ; U ← T [1 :] ∪ [ u ] ; ctx_weighted_L ← (cid:88) j Q j L U j ; ctx_weighted_R ← (cid:88) j Q j R U j ; L ← L − log( σ ( (cid:104) R t , ctx_weighted_L (cid:105) + (cid:104) L t , ctx_weighted_R (cid:105) )) ; D M

ISCELLANEOUS

D.1 S

ENSITIVITY

The following ﬁgures show the sensitivity of fanout and walk depth for WYS on the Reddit dataset.16ublished as a conference paper at ICLR 2021 A cc u r a c y ( A U C ) step = 2 2 3 4 5 6 7 8Step0.870.880.890.900.910.920.93 A cc u r a c y ( A U C ) fanout = 1fanout = 2 Figure 3: Test AUC score when changing the fanout (left) and random walk length (right)

Time (s) A cc u r a c y dataset citeseercorapubmed framework GTTFPyG

Figure 4: Runtime of GTTF versus PyG for training GCN on citation network datasets. Each lineaverages 10 runs on an Nvidia GTX 1080Ti GPU.D.2 R

UNTIME OF

GCN