Constant Time Graph Neural Networks
aa r X i v : . [ c s . L G ] F e b Constant Time Graph Neural Networks
Ryoma Sato
Makoto Yamada
Hisashi Kashima
Abstract
The recent advancements in graph neural net-works (GNNs) have led to state-of-the-art perfor-mances in various applications, including chemo-informatics, question-answering systems, andrecommender systems. However, scaling upthese methods to huge graphs, such as social net-works and Web graphs, remains a challenge. Inparticular, the existing methods for acceleratingGNNs either are not theoretically guaranteed interms of the approximation error or incur at leasta linear time computation cost. In this study,we reveal the query complexity of the uniformnode sampling scheme for Message Passing Neu-ral Networks including GraphSAGE, graph atten-tion networks (GATs), and graph convolutionalnetworks (GCNs). Surprisingly, our analysis re-veals that the complexity of the node samplingmethod is completely independent of the numberof the nodes, edges, and neighbors of the inputand depends only on the error tolerance and con-fidence probability while providing a theoreticalguarantee for the approximation error. To thebest of our knowledge, this is the first paper toprovide a theoretical guarantee of approximationfor GNNs within constant time. Through experi-ments with synthetic and real-world datasets, weinvestigated the speed and precision of the nodesampling scheme and validated our theoretical re-sults.
1. Introduction
Machine learning on graph structures has various ap-plications, such as chemo-informatics (Gilmer et al.,2017; Zhang et al., 2018), question answering systems(Schlichtkrull et al., 2018; Park et al., 2019), and recom-mender systems (Ying et al., 2018; Wang et al., 2019a;b;Fan et al., 2019). Recently, a novel machine learningmodel for graph data called graph neural networks (GNNs) Kyoto University RIKEN AIP JST PRESTO. Correspon-dence to: Ryoma Sato < [email protected] > . (Gori et al., 2005; Scarselli et al., 2009) demonstrated state-of-the-art performances in various graph learning tasks.However, large scale graphs, such as social networks andWeb graphs, contain billions of nodes, and even a lineartime computation cost per iteration is prohibitive. There-fore, the application of GNNs to huge graphs presents achallenge. Although Ying et al. (2018) succeeded in apply-ing GNNs to a Web-scale network using MapReduce, theirmethod still requires massive computational resources.There are several node sampling techniques for reduc-ing GNN computation. For example, an empirical neigh-bor sampling scheme is used to speed up GraphSAGE(Hamilton et al., 2017). FastGCN employs a random layer-wise node sampling (Chen et al., 2018a). Huang et al.(2018) further improved FastGCN by using an adaptivesampling technique to reduce the variance of estimators.Chen et al. (2018b) proposed a variant of neighbor sam-pling, which used historical activations to reduce the es-timator variance. ClusterGCN (Chiang et al., 2019) clus-ters nodes into dense blocks and aggregates features withineach block. LADIES (Zou et al., 2019) combines layer-wise sampling with neighbor sampling to improve effi-ciency. Overall, the existing sampling techniques forGNNs are effective in practice. However, these techniqueseither are not theoretically guaranteed in terms of approx-imation error or incur at least a linear time computationcost.In this study, we considered the problem of approximat-ing the embedding of one node using GNNs in constanttime with maximum precision . We analyzed the neigh-bor sampling technique (Hamilton et al., 2017) to showthat only a constant number of samples is needed to guar-antee the approximation error for Message Passing Neu-ral Networks (Gilmer et al., 2017) including GraphSAGE(Hamilton et al., 2017), GATs (Veliˇckovi´c et al., 2018), andGCNs (Kipf & Welling, 2017). It should be noted that theneighbor sampling technique was introduced as a heuris-tic method originally, and no theoretical guarantees were E.g., the problem of predicting whether a user of an SNSclicks an advertisement using GNNs in real time (i.e., when theuser accesses the SNS). A user may have many neighbors, butGNNs must respond within a limited time, which prohibits exactcomputation. This motivated us to approximate the exact compu-tation in limited time with a theoretical guarantee. onstant Time Graph Neural Networks provided. Specifically, we prove PAC learning-like boundsof the approximation errors of neighbor sampling. Givenan error tolerance ε and confidence probability − δ , ouranalysis shows that the estimate ˆ z v of the exact embedding z v of a node v , such that Pr [ k ˆ z v − z v k ≥ ε ] ≤ δ , andthe estimate d ∂ z v ∂ θ of the exact gradient ∂ z v ∂ θ of the embed-ding z v with respect to the network parameters θ , suchthat Pr [ k d ∂ z v ∂ θ − ∂ z v ∂ θ k F ≥ ε ] ≤ δ , can be computed in con-stant time. In particular, the uniform node sampling canapproximate the exact embedding and its gradients within O ( ε L (log ε + log δ ) L − log δ ) time, where L denotes thenumber of layers. This complexity is completely indepen-dent of the number of the nodes, edges, and neighbors ofthe input, which enables us to deal with graphs irrespec-tive of their size. Moreover, the complexity is a polyno-mial with respect to ε and log δ . We demonstrate that thetime complexity is optimal when L = 1 with respect tothe error tolerance ε . Our analysis can be applied to theprediction setting by considering the prediction problem as -dimensional embedding in [0 , . In that case, the predic-tion with the approximation is correct if the exact computa-tion predicts correctly with margin ǫ .In addition to presenting positive results, we show thatsome existing GNNs, including the original GraphSAGE(Hamilton et al., 2017), cannot be approximated in constanttime by any algorithm. Our theoretical results not onlyprovide guarantees of approximation errors but also revealwhich information each GNN model gives importance to inthe light of the computational complexity. The GNN archi-tectures that can be approximated in constant time do notuse the fine-grained information but only the coarse infor-mation of the input graph, whereas the GNN architecturesthat cannot be approximated in constant time do use all theinformation of the input graph. These observations providetheoretical characteristics of GNN architectures.We experimentally show the validity of our theories includ-ing negative results, and we show that the approximationerror between the exact computation and its approximationrapidly converges to zero when possible. To the best ofour knowledge, this is the first analysis concerning constanttime approximation for GNNs with a theoretical guarantee. Contributions: • We analyze the neighbor sampling technique forGraphSAGE, graph attention networks (GATs), andGCNs to provide theoretical justification. In partic-ular, our analysis shows that the complexity is com-pletely independent of the number of nodes, edges,and neighbors of the input. • We show that some existing GNNs, including the orig-inal GraphSAGE (Hamilton et al., 2017), cannot be approximated in constant time by any algorithm (seeTable 1 for details). • We empirically validate our theorems using syntheticand real-world datasets.
2. Related Work
Originally, GNN-like models were proposed in the chem-istry field (Sperduti & Starita, 1997; Baskin et al., 1997).Sperduti & Starita (1997) applied a linear aggregation oper-ation and a non-linear activation function recursively, andBaskin et al. (1997) utilized parameter sharing to model in-variant transformations on node and edge features. Thesefeatures are common in modern GNNs. Gori et al. (2005)and Scarselli et al. (2009) proposed novel graph learningmodels and named them graph neural networks (GNNs).Their models recursively apply the propagation functionuntil convergence to obtain node embeddings. Li et al.(2016) improved their models to create gated graph neu-ral networks by introducing LSTM and GRU. Moleculargraph networks (Merkwirth & Lengauer, 2005) is a con-current model of the GNNs (Gori et al., 2005) with sim-ilar architecture, but they use a constant number of lay-ers. Duvenaud et al. (2015) constructed a GNN modelbased on circular fingerprints. Bruna et al. (2014) andDefferrard et al. (2016) took advantage of graph spectralanalysis and graph signal processing to construct GNNmodels. Kipf & Welling (2017) proposed GCNs, whichsignificantly outperform the existing methods, includingnon-neural network-based approaches. They approximatea spectral model by linear functions with respect to thegraph Laplacian and reduced spectral models to spatialmodels. Gilmer et al. (2017) proposed message pass-ing neural networks (MPNNs), a general framework ofGNNs using the message passing mechanism. GATs(Veliˇckovi´c et al., 2018) improve the performance of GNNsgreatly by incorporating the attention mechanism. Withthe advent of GATs, various GNN models with the atten-tion mechanism have been proposed (Wang et al., 2019a;Park et al., 2019). Recently, the expressive power of GNNshas been revealed theoretically. For example, Xu et al.(2019) and Morris et al. (2019) showed that the expres-sive power of message passing GNNs is at most the powerof the WL- test (Weisfeiler & Lehman, 1968). Sato et al.(2019) showed that the set of problems that message pass-ing GNNs can solve is the same as that which can be solvedby distributed local algorithms (Angluin, 1980; Suomela,2013).GraphSAGE (Hamilton et al., 2017) is an additional GNNmodel, which employs neighbor sampling to reduce thecomputational costs of training and inference. Neighbor onstant Time Graph Neural Networks Table1. ✓ indicates that neighbor sampling approximates the network in constant time. ✗ indicates that no algorithm can approximatethe network in constant time. ✓ in the Gradient column indicates that the error between the gradient of the approximated embeddingand that of the exact embedding is also theoretically bounded. ✓ ∗ requires an additional condition to approximate it in constant time.GATs, GraphSAGE-GCN, GraphSAGE-mean GraphSAGE-pool GCNsActivation Embedding Gradient Embedding Gradientsigmoid / tanh ✓ ✓ ✗ ✓ ∗ ✓ ∗ Thm. 1 Thm. 4 Thm. 9 Thm. 1 Thm. 4ReLU ✓ ✗ ✗ ✓ ∗ ✗ Thm. 1 Thm. 8 Thm. 9 Thm. 1 Thm. 8ReLU + normalization ✗ ✗ ✗ ✗ ✗
Thm. 7 Thm. 7 Thm. 9 Thm. 7 Thm. 7 sampling enables GraphSAGE to deal with large graphs.However, neighbor sampling was introduced without anytheoretical guarantee, and the number of samples is chosenempirically. An alternative computationally efficient GNNwould be FastGCN (Chen et al., 2018a), which employslayer-wise random node sampling to speed up training andinference. Huang et al. (2018) further improved FastGCNby using an adaptive node sampling technique to reducethe variance of estimators. By virtue of the adaptive sam-pling technique, it reduces the computational costs and out-performs neighbor sampling in terms of classification accu-racy and convergence speed. Chen et al. (2018b) proposedan alternative neighbor sampling technique, which uses his-torical activations to reduce the estimator variance. Addi-tionally, it can achieve zero variance after a certain numberof iterations. However, because it uses the same samplingtechnique as GraphSAGE to obtain the initial solution, theapproximation error is not theoretically bounded until the Ω( n ) -th iteration. ClusterGCN (Chiang et al., 2019) firstclusters nodes into dense blocks and then aggregates nodefeatures within each block. LADIES (Zou et al., 2019)samples neighboring nodes using importance sampling foreach layer to improve efficiency while keeping variancesmall. However, neither LADIES (Zou et al., 2019) norClusterGCN (Chiang et al., 2019) has a theoretical guaran-tee on the approximation error. Overall, the existing sam-pling techniques are effective in practice. However, theyeither are not theoretically guaranteed in terms of approxi-mation error or incur at least a linear time computation costto calculate the embedding of a node and its gradients. Sublinear time algorithms were originally proposed forproperty testing (Rubinfeld & Sudan, 1996). Sublinearproperty testing algorithms determine whether the inputhas some property π or the input is sufficiently far fromproperty π with high probability in sublinear time with re-spect to the input size. Sublinear time approximation algo-rithms are an additional type of sublinear time algorithms.More specifically, they calculate a value sufficiently closeto the exact value with high probability in sublinear time. Constant time algorithms are a subclass of sublinear timealgorithms. They operate not only in sublinear time withrespect to the input size but also in constant time. The pro-posed algorithm is classified as a constant time approxima-tion algorithm.The examples of sublinear time approximation algo-rithms include minimum spanning tree in metric space(Czumaj & Sohler, 2004) and minimum spanning tree withinteger weights (Chazelle et al., 2005). Parnas & Ron(2007) proposed a method to convert distributed local al-gorithms into constant time approximation algorithms. Intheir paper, they proposed a method to construct constanttime algorithms for the minimum vertex cover problemand dominating set problem. Nguyen & Onak (2008) andYoshida et al. (2009) improved the complexities of these al-gorithms. A classic example of sublinear time algorithmsrelated to machine learning includes clustering (Indyk,1999; Mishra et al., 2001). Examples of recent studies inthis stream include constant time approximation of the min-imum value of quadratic functions (Hayashi & Yoshida,2016) and constant time approximation of the residual errorof the Tucker decomposition (Hayashi & Yoshida, 2017).Hayashi and Yoshida adopted simple sampling strategiesto obtain a theoretical guarantee, as we did in our study.In this paper, we provide a theoretical guarantee for theapproximation of GNNs within constant time for the firsttime.
3. Background
Let G the input graph, V = { , , . . . , n } the set of nodes, n = |V| the number of nodes, E the set of edges, m = |E| the number of edges, deg ( v ) the degree of node v , N ( v ) theset of neighbors of a node v , x v ∈ R d the feature vectorassociated to a node v ∈ V , and X = ( x , x . . . , x n ) ⊤ ∈ R n × d the stacked feature vectors, and let ⊤ denote the ma-trix transpose. onstant Time Graph Neural Networks Algorithm 1 O z : Exact embedding Require:
Graph G = ( V , E ) ; Features X ∈ R n × d ; Nodeindex v ∈ V ; Model parameters θ . Ensure:
Exact embedding z v z (0) i ← x i ( ∀ i ∈ V ) for l ∈ { , . . . , L } do for i ∈ V do h ( l ) i ← P u ∈N ( i ) M liu ( z ( l − i , z ( l − u , e iu , θ ) z ( l ) i ← U l ( z ( l − i , h ( l ) i , θ ) end for end for return z ( L ) v We consider the node embedding problem using GNNswith MPNN framework (Gilmer et al., 2017). This frame-work includes many GNN models, such as GraphSAGEand GCNs. Algorithm 1 shows the algorithm of MPNNs.We refer to z ( L ) v as z v for the sake of simplicity. The aim ofthis study was to show that it is possible to approximate theembedding vector z v and gradients ∂ z v ∂ θ in constant timewith the given model parameters θ and node v . In the design of constant time algorithms, we need to spec-ify a means by which they can access the input, becausethey cannot read the entire input. We follow the standardconvention of sublinear time algorithms (Parnas & Ron,2007; Nguyen & Onak, 2008). We model an algorithm asan oracle machine that can generate queries regarding theinput and we measure the complexity by query complex-ity. Algorithms can access the input only by querying thefollowing oracles: (1) O deg ( v ) : the degree of node v , (2) O G ( v, i ) : the i -th neighbor of node v , and (3) O feature ( v ) :the feature of node v . We assume that an algorithm canquery the oracles in constant time per query. In this paper, we aim to compute an embedding of a singlenode in constant time because it is useful in online settingsas we introduced in the footnote of the introduction. Insome applications, computing embeddings of all nodes isrequired. In such cases, embeddings can be computed in O ( n ) time if each embedding can be computed in constanttime, while exact computation requires at least O ( m ) time,which is large if the input graph is dense.Formally, given a node v , we compute the following func-tions with the least number of oracle accesses: (1) O z ( v ) :the embedding z v and (2) O g ( v ) : the gradients of param-eters ∂ z v ∂ θ . However, the exact computation of O z and O g requires at least deg ( v ) queries to aggregate the featuresfrom the neighbor nodes, which is not constant time with respect to the input size. Thus, it is computationally ex-pensive to execute the algorithm for a very large network,which motivates us to make the following approximations. • ˆ O z ( v, ε, δ ) : an estimate ˆ z v of z v such that Pr [ k ˆ z i − z i k ≥ ε ] ≤ δ , • ˆ O g ( v, ε, δ ) : an estimate d ∂ z v ∂ θ of ∂ z v ∂ θ such thatPr [ k d ∂ z v ∂ θ − ∂ z v ∂ θ k F ≥ ε ] ≤ δ ,where ε > is the error tolerance, − δ is the confidenceprobability, and k · k and k · k F are the Euclidean andFrobenius norm, respectively.Under the fixed model structure (i.e., the number of lay-ers L , the message passing functions, and the update func-tions), we construct an algorithm that calculates ˆ O z and ˆ O g in constant time irrespective of the number of the nodes,edges, and neighbors of the input.However, it is impossible to construct a constant time algo-rithm without any assumption about the inputs, as shown inSection 5. Therefore, we make the following mild assump-tions. Assumption 1 ∃ B ∈ R s.t. k x i k ≤ B , k e iu k ≤ B , and k θ k ≤ B . Assumption 2 deg ( i ) M liu and U l are uniformly continu-ous in any bounded domain. Assumption 3 (Only for gradient computation)deg ( i ) DM liu and DU l are uniformly continuousin any bounded domain, where D denotes theJacobian operator.Intuitively, the first assumption is needed because the ab-solute error cannot be bounded if the scale of inputs be-comes infinitely large. The second and third assumptionsare needed to prohibit the amplification of errors in the mes-sage passing phase. Furthermore, we derive the query com-plexity of neighbor sampling when the message functionsand update functions satisfy the following assumptions. Assumption 4 ∃ K ∈ R s.t. deg ( i ) M liu and U l are K -Lipschitz continuous in any bounded domain. Assumption 5 (Only for gradient computation) ∃ K ′ ∈ R s.t. deg ( i ) DM liu and DU l are K ′ -Lipschitz continu-ous in any bounded domain.
4. Main Results onstant Time Graph Neural Networks
Algorithm 2 ˆ O ( l ) z : Estimate the embedding z ( l ) v Require:
Graph G = ( V , E ) (as oracle); Features X ∈ R n × d (as oracle); Node index v ∈ V ; Model parameters θ ; Errortolerance ε ; Confidence probability − δ . Ensure:
Approximation of the embedding z ( l ) v S ( l ) ← sample r ( l ) ( ε, δ ) neighbors of v with uniform random with replacement. ˆ h ( l ) v ← ( O deg ( v ) r ( l ) P u ∈S ( l ) M lvu ( O feature ( v ) , O feature ( u ) , e iu , θ ) ( l = 1) O deg ( v ) r ( l ) P u ∈S ( l ) M lvu ( ˆ O ( l − z ( v ← v ) , ˆ O ( l − z ( v ← u ) , e iu , θ ) ( l > ˆ z ( l ) v ← U l ( ˆ O ( l − z ( v ← v ) , ˆ h ( l ) v , θ ) if l > otherwise U l ( O feature ( v ) , ˆ h ( l ) v , θ ) return ˆ z i Now, we describe the construction of a constant time ap-proximation algorithm based on neighbor sampling, whichapproximates the embedding z v with an absolute error ofat most ε and probability − δ . We recursively constructedthe algorithm layer by layer by sampling r ( l ) neighboringnodes in layer l . We refer to the algorithm that calcu-lates the estimate of the embeddings in the l -th layer z ( l ) as ˆ O ( l ) z ( l = 1 , . . . , L ) . Algorithm 2 presents the pseudocode. Here, ˆ O ( l − z ( v ← u ) represents calling the function ˆ O ( l − z with the same parameters as the current function,except for v , which is replaced by u . In the following, wedemonstrate the theoretical properties of Algorithm 2. Theorem 1.
For all ε > , > δ > , there exists r ( l ) ( ε, δ ) ( l = 1 , . . . , L ) such that for all inputs satisfyingAssumptions 1 and 2, the following property holds true: Pr [ k ˆ O z ( v, ε, δ ) − z v k ≥ ε ] ≤ δ. Theorem 1 shows that the approximation error of Algo-rithm 2 is bounded by ε with probability − δ . It is provedby applying Hoeffding’s inequality (Hoeffding, 1963) re-cursively. The complete proof is provided in the appen-dices. Because the number of sampled nodes depends onlyon ε and δ and is independent of the number of the nodes,edges, and neighbors of the input, Algorithm 2 operates inconstant time. We provide the complexity when the func-tions are Lipschitz continuous. Theorem 2.
Under Assumptions 1 and 4, r ( L ) = O ( ε log δ ) and r (1) , . . . , r ( L − = O ( ε (log ε + log δ )) are sufficient, and the query complexity of Algorithms 2 is O ( ε L (log ε + log δ ) L − log δ ) . Although the complexity is exponential with respect to thenumber of layers, this is nonetheless beneficial, because thenumber of layers is usually small in practice. For exam-ple, the original GCN employs two layers (Kipf & Welling,2017). It is noteworthy that, although most constant time al-gorithms proposed in the literature also depend on some pa-rameters exponentially, they have nonetheless been provedto be effective. For example, the constant time algorithmsof Yoshida et al. (2009) for the maximum matching prob- lem and minimum set cover problem use d O ( ε ) ( ε ) O ( ε ) and ε ( st ) O ( s ) queries, respectively, where d is the maxi-mum degree and s is the number of elements in a set, andeach element is in at most t sets. The important point hereis that the complexity is completely independent of the sizeof the inputs, which is desirable, especially when the inputsize is very large. In addition, we show that the query com-plexity of Algorithm 2 is optimal with respect to ε if thenumber of layers is one. In other words, a one-layer modelcannot be approximated in o ( ε ) time by any algorithm. Theorem 3.
Under Assumptions 1 and 4 and L = 1 , thetime complexity of Algorithm 2 in Theorem 2 is optimalwith respect to the error tolerance ε . The proof is based on Chazelle et al.’s lemma(Chazelle et al., 2005). The optimality when L ≥ is an open problem. Now, we show that the neighbor sampling technique canguarantee the approximation errors of the gradient of em-beddings with respect to the model parameters. Let ∂ z v ∂ θ bethe gradient of the embedding z v with respect to the modelparameter θ , i.e., ( ∂ z v ∂ θ ) ijk = ∂ z vi ∂ θ jk . Theorem 4.
For all ε > , > δ > , there exists r ( l ) ( ε, δ ) ( l = 1 , . . . , L ) such that for all inputs satisfyingAssumptions 1, 2, and 3, the following property holds true: Pr [ k \ ∂ z ( L ) v ∂ θ − ∂ z ( L ) v ∂ θ k F ≥ ε ] ≤ δ, where [ ∂ z ( L ) v ∂ θ is the gradient of ˆ z ( L ) v , which is obtained by ˆ O ( L ) z ( v, ε, δ ) , with respect to θ . Therefore, an estimate of the gradient of the embeddingwith respect to parameters with an absolute error of atmost ε and probability − δ can be calculated by running ˆ O ( L ) z ( v, ε, δ ) and calculating the gradient of the obtainedestimate of the embedding. We provide the complexitywhen the functions are Lipschitz continuous. onstant Time Graph Neural Networks Theorem 5.
Under Assumptions 1, 4, and 5, r ( L ) = O ( ε log δ ) and r (1) , . . . , r ( L − = O ( ε (log ε + log δ )) are sufficient, and the gradient of the embedding with re-spect to parameters can be approximated with an absoluteerror of at most ε and probability − δ in O ( ε L (log ε +log δ ) L − log δ ) time. Note that, technically, MPNNs do not include GATs, be-cause these networks use embeddings of other neighboringnodes to calculate the attention value. However, Theorems1 and 4 can be naturally extended to GATs, and we canapproximate GATs in constant time by neighbor sampling.This is surprising, because GATs may pay considerable at-tention to some nodes, and uniform sampling may fail tosample these nodes. However, it can be shown that our as-sumptions, which do not seem related to this issue, prohibitthis situation. The details are described in the appendices.
5. Inapproximability
In this section, we show that some existing GNNs cannotbe approximated in constant time. The theorems state thatthese models cannot be approximated in constant time byeither neighbor sampling or any other algorithm . In otherwords, for any algorithm that operates in constant time,there exists an error tolerance ε , a confidence probability − δ , and a counter example input such that the approxi-mation error for the input is more than ε with probability δ . This indicates that the application of an approximationmethod to these models requires close supervision, becausethe obtained embedding may be significantly different fromthe exact embedding. Proposition 6. If k x i k or k θ k F is not bounded, even un-der Assumption 2, the embeddings of GraphSAGE-GCNcannot be approximated with arbitrary precision and prob-ability in constant time. Proposition 6 indicates that Assumption 1 is necessary forconstant-time approximation.
Proposition 7.
Even under Assumption 1, the embeddingsand gradients of GraphSAGE-GCN with ReLU activationand normalization cannot be approximated with arbitraryprecision and probability in constant time.
In other words, Proposition 7 shows that the original modelof GraphSAGE-GCN cannot be approximated within con-stant time. We confirm this fact through computational ex-periments in Section 6 (Q2).
Proposition 8.
Even under Assumptions 1 and 2, the gra-dients of GraphSAGE-GCN with ReLU activation cannotbe approximated with arbitrary precision and probabilityin constant time.
This indicates that the activation function is importantfor constant-time approximability, because the gradient of GraphSAGE-GCN with sigmoid activation can be approx-imated within constant time by neighbor sampling (Theo-rem 4). We confirm this fact through computational experi-ments in Section 6 (Q3). The following two theorems statethat GraphSAGE-pool and GCN cannot be approximatedin constant time even under Assumptions 1, 2, and 3.
Proposition 9.
Even under Assumptions 1, 2, and 3, theembeddings of GraphSAGE-pool cannot be approximatedwith arbitrary precision and probability in constant time.
Proposition 10.
Even under Assumptions 1, 2, and 3, theembeddings of GCN cannot be approximated with arbi-trary precision and probability in constant time.
Constant Time Approximation for GCNs:
Because ofthe inapproximability theorems, GCNs cannot be approxi-mated in constant time. However, GCNs can be approxi-mated in constant time if the input graph satisfies the fol-lowing property.
Assumption 6
There exists a constant C ∈ R such that,for any input graph G = ( V , E ) and node v, u ∈V , the ratio of deg ( v ) to deg ( u ) is at most C (i.e.,deg ( v ) / deg ( u ) ≤ C ).Assumption 6 prohibits input graphs that have a skeweddegree distribution. GCNs require Assumption 6, becausethe norm of the embedding is not bounded and the influenceof anomaly nodes with low degrees is significant withoutit. It should be noted that the GraphSAGE-pool cannot beapproximated in constant time even under Assumption 6.We confirm this fact through computational experiments inSection 6 (Q5).These negative results reveal which information each GNNmodel gives importance to in the light of the computationalcomplexity. For example, GCNs cannot be approximatedin constant time when a high-degree node that neighborsanomaly low-degree nodes exists in the input graph, andGCNs requires the information of anomaly nodes in addi-tion to majority nodes to be approximated accurately. Thisindicates that the exact computation of GCNs gives impor-tance to these anomaly nodes as well as majority nodes. Incontrast, GraphSAGE-GCN ignores these anomaly nodesand takes only majority nodes into account even in the ex-act computation. This is why GraphSAGE-GCN can be ap-proximated accurately only with a constant number of sam-ples. This observation suggests that GCNs should be usedwhen the anomaly nodes play important roles in the prob-lem considered, and otherwise GraphSAGE-GCN shouldbe used, because GraphSAGE-GCN can infer faster thanGCNs by neighbor sampling. This type of observation canbe applied to other models such as GraphSAGE-pool andGATs as well. onstant Time Graph Neural Networks
6. Experiments
In this study, we focused on validating our theo-retical results through numerical experiments, becauseHamilton et al. (2017) have already reported the effect ofneighbor sampling for downstream machine learning tasks(e.g., classification). Namely, we answer the followingquestions through experiments.
Q1:
How fast is the constant time approximation algo-rithm (Algorithm 2)?
Q2:
Does neighbor sampling (Algorithm 2) accurately ap-proximate the embeddings of GraphSAGE-GCN with-out normalization (Theorem 1), whereas it cannot ap-proximate the original one (Proposition 7)?
Q3:
Does neighbor sampling (Algorithm 2) accurately ap-proximate the gradients of GraphSAGE-GCN withsigmoid activation (Theorem 4), whereas it cannot ap-proximate that with ReLU activation (Proposition 8)?
Q4:
Is the theoretical rate of the approximation error ofAlgorithm 2 tight?
Q5:
Does neighbor sampling fail to approximate GCNswhen the degree distribution is skewed (Proposition10)? Does it succeed when the node distribution is flat(Assumption 6)?
Q6:
Does neighbor sampling work efficiently for realdata?The experimental details are described in the appendices.
Experiments for Q1:
We measured the speed of ex-act computation and the neighbor sampling of two-layerGraphSAGE-GCN and two-layer GATs. We initialized pa-rameters using the i.i.d. standard multivariate normal dis-tribution. The input graph was a clique K n . We usedten-dimensional vectors from the i.i.d. standard multivari-ate normal distribution as the node features. We took r (1) = r (2) = 100 samples. For each method and n =2 , , . . . , , we ran inference times and measuredthe average time consumption and standard deviation. Fig-ure 1 (a) plots the speed of these methods as the number ofnodes increases. This shows that the neighbor sampling isseveral orders of magnitude faster than the exact computa-tion when the graph size is extremely large. Experiments for Q2:
We used the original one-layerGraphSAGE-GCN (with ReLU activation and normaliza-tion) and one-layer GraphSAGE-GCN with ReLU activa-tion. The input graph was a clique K n , the features ofwhich were x = (1 , ⊤ and x i = (0 , /n ) ⊤ ( i = 1) ,and the weight matrix was an identity matrix I . We used r (1) = 5 , , and as the sample size. If a model can be approximated in constant time, the approximation er-ror goes to zero as the sample size increases, even if thegraph size reaches infinity. The approximation errors ofboth models are illustrated in Figure 1 (b). The approxi-mation error of the original GraphSAGE-GCN convergesto approximately . even if the sample size increases. Incontrast, the approximation error without normalization be-comes increasingly bounded as the sample size increases.This is consistent with Theorems 1 and 7. Experiments for Q3:
We examined the approximationerrors of the gradients using the one-layer GraphSAGE-GCN with ReLU activation and sigmoid activation. Theinput graph was a clique K n , the features of which were x = (1 , ⊤ and x i = (1 , ⊤ ( i = 1) , and the weightmatrix was (( − , . We used r (1) = 5 , , and as thesample size. Figure 1 (c) illustrates the approximation errorof both models. The approximation error with ReLU acti-vation converges to approximately . , even if the samplesize increases. In contrast, the approximation error with sig-moid activation becomes increasingly bounded as the sam-ple size increases. This is consistent with Theorems 4 and8. Experiments for Q4:
We used one-layer GraphSAGE-GCN with sigmoid activation. The input graph was a clique K n with n = 40000 nodes. We set the dimensions of in-termediate embeddings as , and each feature value wasset to with probability . and − otherwise. This sat-isfies Assumption 1, i.e., k x i k ≤ √ . We computed theapproximation errors of Algorithm with different num-bers of samples. Figure 1 (d) illustrates the -th percentilepoint of empirical approximation errors and the theoreticalbound by Theorem 2, i.e., ε = O ( r − / ) . It shows that theapproximation error decreases together with the theoreticalrate. This indicates that the theoretical rate is tight. Experiments for Q5:
We analyzed the instances whenneighbor sampling succeeds and fails for a variety of mod-els. First, we used the Barabasi–Albert model (BA model)(Barabasi & Albert, 1999) to generate input graphs. Thedegree distribution of the BA model follows a power law,which indicates neighbor sampling will fail to approximateGCNs (Proposition 10, Assumption 6). We used two-layerGraphSAGE-GCN, GATs, and GCNs with ReLU activa-tion. We used ten-dimensional vectors from the i.i.d. stan-dard multivariate normal distribution as the node features.We used the same number r of samples in the first and sec-ond layer, i.e., r = r (1) = r (2) , and we changed the valueof r from to . We used graphs with n = r nodes.Note that if constant time approximation is possible, theerror is bounded as the number of samples increases evenif the number of nodes increases. Figure 1 (e) shows thatthe error of GCNs linearly increases, even if the numberof samples increases. However, the errors of GraphSAGE- onstant Time Graph Neural Networks (a) Speed. L e rr o r w i t h n o r m a li z a t i o n with normalization 5 sampleswith normalization 30 sampleswith normalization 100 sampleswithout normalization 5 sampleswithout normalization 30 sampleswithout normalization 100 samples L e rr o r w i t h o u t n o r m a li z a t i o n (b) Inference. L e rr o r o f R e L U ReLU 5 samplesReLU 30 samplesReLU 100 samplessigmoid 5 samplessigmoid 30 samplessigmoid 100 samples L e rr o r o f s i g m o i d (c) Gradient. −2 −1 L e rr o r ExperimentalTheorem 2 (ε = O(r −1/2 )) (d) Theoretical rate. N o r m a li z e d L a pp r o x . e rr o r GraphSAGE-GCNGATGCN (e) BA model. N o r m a li z e d L a pp r o x . e rr o r GraphSAGE-GCNGATGCNGraphSAGE-pool (f) ER model.
Figure1. (a) Inference speed of each method. (b) Approximation error of the original GraphSAGE-GCN (i.e., with ReLU activationand normalization) and GraphSAGE-GCN with ReLU activation. (c) Approximation error of the gradient with ReLU and sigmoidactivations. (d) Approximation error of Algorithm 2 and its theoretical bound. (e) Approximation error of GraphSAGE-GCN, GATs,and GCNs with the Barabasi–Albert model. (f) Approximation error of GraphSAGE-GCN, GATs, GCNs, and GraphSAGE-pool withthe Erd˝os–R´enyi model.
GCN and GATs gradually decrease as the number of sam-ples increases. This indicates that we cannot bound the ap-proximation error of GCNs however the large number of ex-amples we use. This is consistent with Proposition 10. Thisresult indicates that the approximation of GCNs requiresclose supervision when the input graph is a social network,because the degree distribution of a social network presentsthe power law as the BA model.Next, we used the Erd˝os–R´enyi model (ER model)(Erd˝os & R´enyi, 1959). It generates graphs with flat de-gree distribution. We used the two-layer GraphSAGE-GCN, GATs, GCNs, and GraphSAGE-pool. Figure 1 (f)shows the approximation error. It shows that the errors ofGraphSAGE-GCN, GATs, and GCNs gradually decreaseas the number of samples increases. This is consistent withTheorem 1 and Assumption 6. In contrast, the approxima-tion error of GraphSAGE-pool does not decrease, even ifthe input graphs are generated by the ER model. This isconsistent with Proposition 9. L e rr o r ( n o r m a li z e d ) CoraPubMedRedditCora W (1)
Cora W (2)
PubMed W (1)
PubMed W (2)
Reddit W (1)
Reddit W (2)
Figure2. Real data.
Experiments for Q6:
We used three real-worlddatasets: Cora, PubMed,and Reddit. They contain , , and nodes, respectively. Werandomly chose nodes for validation and nodes for testing and use the remaining nodes fortraining. We used two-layer GraphSAGE-GCN with sig-moid activation in this experiment. The dimensions of thehidden layers were set to , and we used an additionalfully connected layer to predict the labels of the nodesfrom the embeddings. We trained the models with Adam(Kingma & Ba, 2015) with a learning rate of . . Wefirst trained ten models with training nodes for each dataset.The micro-F1 scores of Cora, PubMed, and Reddit were . , . , and . , respectively. It should be notedthat we did not aim to obtain high classification accuracyin this experiment but intended only to sanity check themodels. In this experiment, we used the same number r of samples in the first and second layer to computethe embeddings of test nodes, i.e., r = r (1) = r (2) , andchanged the value of r from to . We also calculatedthe gradient of parameter matrices W (1) and W (2) withrespect to the embedding obtained by Algorithm 2. Wecompared these values with exact embeddings and exactgradients. Figure 6 illustrates the -th percentile point ofthe empirical approximation errors. We normalized themto ensure that the value was . at r = 1 to demonstrate thedecreasing rate. The figure shows that the approximationerrors of embeddings and gradients rapidly decrease forthe real-world data. onstant Time Graph Neural Networks
7. Conclusion
We analyzed neighbor sampling to prove that it can approx-imate the embedding and gradient of GNNs in constanttime, where the complexity is completely independent ofthe number of the nodes, edges, and neighbors of the input.This is the first analysis that offers constant time approxi-mation for GNNs. We further demonstrated that some ex-isting GNNs cannot be approximated in constant time byany algorithm. Finally, we validated the theory through ex-periments using synthetic and real-world datasets.
8. Acknowledgments
This work was supported by JSPS KAKENHI Grant-Number 15H01704 and the JST PRESTO program JP-MJPR165A.
References
Angluin, D. Local and global properties in networks ofprocessors (extended abstract). In
Proceedings of the12th Annual ACM Symposium on Theory of Computing,STOC , pp. 82–93, 1980.Barabasi, A.-L. and Albert, R. Emergence of scaling inrandom networks.
Science , 286(5439):509–512, 1999.Baskin, I. I., Palyulin, V. A., and Zefirov, N. S. A neuraldevice for searching direct correlations between struc-tures and properties of chemical compounds.
Journalof Chemical Information and Computer Sciences , 37(4):715–721, 1997.Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spec-tral networks and locally connected networks on graphs.In , 2014.Chazelle, B., Rubinfeld, R., and Trevisan, L. Approximat-ing the minimum spanning tree weight in sublinear time.
SIAM J. Comput. , 34(6):1370–1379, 2005.Chen, J., Ma, T., and Xiao, C. FastGCN: Fast learning withgraph convolutional networks via importance sampling.In
Proceedings of the Sixth International Conference onLearning Representations, ICLR , 2018a.Chen, J., Zhu, J., and Song, L. Stochastic training of graphconvolutional networks with variance reduction. In
Pro-ceedings of the 35th International Conference on Ma-chine Learning, ICML , pp. 941–949, 2018b.Chiang, W., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh,C. Cluster-gcn: An efficient algorithm for training deepand large graph convolutional networks. In
Proceedingsof the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD , pp. 257–266, 2019.Czumaj, A. and Sohler, C. Estimating the weight of metricminimum spanning trees in sublinear-time. In
Proceed-ings of the 36th Annual ACM Symposium on Theory ofComputing, STOC , pp. 175–183, 2004.Defferrard, M., Bresson, X., and Vandergheynst, P. Con-volutional neural networks on graphs with fast localizedspectral filtering. In
Advances in Neural InformationProcessing Systems 29, NIPS , pp. 3837–3845, 2016.Duvenaud, D., Maclaurin, D., Aguilera-Iparraguirre, J.,G´omez-Bombarelli, R., Hirzel, T., Aspuru-Guzik, A.,and Adams, R. P. Convolutional networks on graphsfor learning molecular fingerprints. In
Advances in Neu-ral Information Processing Systems 28, NIPS , pp. 2224–2232, 2015.Erd˝os, P. and R´enyi, A. On random graphs I.
PublicationesMathematicae , 6:290–297, 1959.Fan, W., Ma, Y., Li, Q., He, Y., Zhao, Y. E., Tang, J., andYin, D. Graph neural networks for social recommenda-tion. In
The World Wide Web Conference, WWW , pp.417–426, 2019.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., andDahl, G. E. Neural message passing for quantum chem-istry. In
Proceedings of the 34th International Confer-ence on Machine Learning, ICML , pp. 1263–1272, 2017.Glorot, X. and Bengio, Y. Understanding the difficulty oftraining deep feedforward neural networks. In
Proceed-ings of the Thirteenth International Conference on Arti-ficial Intelligence and Statistics, AISTATS , pp. 249–256,2010.Gori, M., Monfardini, G., and Scarselli, F. A new model forlearning in graph domains. In
Proceedings of the Inter-national Joint Conference on Neural Networks, IJCNN ,volume 2, pp. 729–734, 2005.Hamilton, W. L., Ying, Z., and Leskovec, J. Inductive rep-resentation learning on large graphs. In
Advances in Neu-ral Information Processing Systems 30, NIPS , pp. 1025–1035, 2017.Hayashi, K. and Yoshida, Y. Minimizing quadratic func-tions in constant time. In
Advances in Neural Informa-tion Processing Systems 29, NIPS , pp. 2217–2225, 2016.Hayashi, K. and Yoshida, Y. Fitting low-rank tensors inconstant time. In
Advances in Neural Information Pro-cessing Systems 30, NIPS , pp. 2470–2478, 2017. onstant Time Graph Neural Networks
Hoeffding, W. Probability inequalities for sums of boundedrandom variables.
Journal of the American StatisticalAssociation , 58(301):13–30, March 1963.Huang, W., Zhang, T., Rong, Y., and Huang, J. Adaptivesampling towards fast graph representation learning. In
Advances in Neural Information Processing Systems 31,NeurIPS , 2018.Indyk, P. A sublinear time approximation scheme for clus-tering in metric spaces. In
Proceedings of the 40th An-nual Symposium on Foundations of Computer Science,FOCS , pp. 154–159, 1999.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In
Proceedings of the Third InternationalConference on Learning Representations, ICLR , 2015.Kipf, T. N. and Welling, M. Semi-supervised classifica-tion with graph convolutional networks. In
Proceedingsof the Fifth International Conference on Learning Repre-sentations, ICLR , 2017.Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. S.Gated graph sequence neural networks. In ,2016.Merkwirth, C. and Lengauer, T. Automatic generationof complementary descriptors with molecular graph net-works.
Journal of Chemical Information and Modeling ,45(5):1159–1168, 2005.Mishra, N., Oblinger, D., and Pitt, L. Sublinear timeapproximate clustering. In
Proceedings of the TwelfthAnnual Symposium on Discrete Algorithms, SODA , pp.439–447, 2001.Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen,J. E., Rattan, G., and Grohe, M. Weisfeiler and lemango neural: Higher-order graph neural networks. In
TheThirty-Third AAAI Conference on Artificial Intelligence,AAAI , pp. 4602–4609, 2019.Nguyen, H. N. and Onak, K. Constant-time approxima-tion algorithms via local improvements. In
Proceedingsof the 49th Annual IEEE Symposium on Foundations ofComputer Science, FOCS , pp. 327–336, 2008.Park, N., Kan, A., Dong, X. L., Zhao, T., and Faloutsos,C. Estimating node importance in knowledge graphs us-ing graph neural networks. In
Proceedings of the 25thACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining, KDD , pp. 596–606, 2019.Parnas, M. and Ron, D. Approximating the minimumvertex cover in sublinear time and a connection to dis-tributed algorithms.
Theor. Comput. Sci. , 381(1-3):183–196, 2007. Rubinfeld, R. and Sudan, M. Robust characterizations ofpolynomials with applications to program testing.
SIAMJ. Comput. , 25(2):252–271, 1996.Sato, R., Yamada, M., and Kashima, H. Approximationratios of graph neural networks for combinatorial prob-lems. In
Advances in Neural Information ProcessingSystems 32, NeurIPS , 2019.Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., andMonfardini, G. The graph neural network model.
IEEETrans. Neural Networks , 20(1):61–80, 2009.Schlichtkrull, M. S., Kipf, T. N., Bloem, P., van den Berg,R., Titov, I., and Welling, M. Modeling relational datawith graph convolutional networks. In
The SemanticWeb - 15th International Conference, ESWC , pp. 593–607, 2018.Sperduti, A. and Starita, A. Supervised neural networksfor the classification of structures.
IEEE Trans. NeuralNetworks , 8(3):714–735, 1997.Suomela, J. Survey of local algorithms.
ACM Comput.Surv. , 45(2):24:1–24:40, 2013.Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A.,Li`o, P., and Bengio, Y. Graph attention networks. In
Pro-ceedings of the Sixth International Conference on Learn-ing Representations, ICLR , 2018.Wang, H., Zhao, M., Xie, X., Li, W., and Guo, M. Knowl-edge graph convolutional networks for recommendersystems. In
The World Wide Web Conference, WWW ,pp. 3307–3313, 2019a.Wang, X., He, X., Cao, Y., Liu, M., and Chua, T. KGAT:knowledge graph attention network for recommendation.In
Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining,KDD , pp. 950–958, 2019b.Weisfeiler, B. and Lehman, A. A. A reduction of a graphto a canonical form and an algebra arising during thisreduction.
Nauchno-Technicheskaya Informatsia , 2(9):12–16, 1968.Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How pow-erful are graph neural networks? In , 2019.Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,W. L., and Leskovec, J. Graph convolutional neural net-works for web-scale recommender systems. In
Proceed-ings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining, KDD , pp. 974–983, 2018. onstant Time Graph Neural Networks
Yoshida, Y., Yamamoto, M., and Ito, H. An improvedconstant-time approximation algorithm for maximummatchings. In
Proceedings of the 41st Annual ACM Sym-posium on Theory of Computing, STOC , pp. 225–234,2009.Zhang, M., Cui, Z., Neumann, M., and Chen, Y. An end-to-end deep learning architecture for graph classification.In
Proceedings of the Thirty-Second AAAI Conferenceon Artificial Intelligence, AAAI , pp. 4438–4445, 2018.Zou, D., Hu, Z., Wang, Y., Jiang, S., Sun, Y., and Gu, Q.Layer-dependent importance sampling for training deepand large graph convolutional networks. In
Advances inNeural Information Processing Systems 32, NeurIPS , pp.11247–11256, 2019. onstant Time Graph Neural Networks
A. Models
We introduce GraphSAGE-GCN, GraphSAGE-mean, GraphSAGE-pool, the graph convolutional networks (GCNs), andthe graph attention networks (GATs) for completeness of our paper.
GraphSAGE-GCN (Hamilton et al., 2017): The message function and the update function of this model is M lvu ( z v , z u , e vu , θ ) = z u deg ( v ) ,U l ( z v , h v , θ ) = σ ( W ( l ) h v ) , where W ( l ) is a parameter matrix and σ is an activation function such as sigmoid and ReLU. GraphSAGE-GCN includesthe center node itself in the set of adjacent nodes (i.e., N ( v ) ← N ( v ) ∪ { v } ). GraphSAGE-mean (Hamilton et al., 2017): The message function and the update function of this model is M lvu ( z v , z u , e vu , θ ) = z v deg ( v ) ,U l ( z v , h v , θ ) = σ ( W ( l ) [ z v , h v ]) , where [ · ] denotes concatenation. GraphSAGE-pool (Hamilton et al., 2017): We do not formulate GraphSAGE-pool using the message function and theupdate function because it takes maximum instead of summation. The model of GraphSAGE-pool is z ( l ) v = max( { σ ( W ( l ) z ( l − u + b ) | u ∈ N ( v ) } ) . GCNs (Kipf & Welling, 2017): The message function and the update function of this model is M lvu ( z v , z u , e vu , θ ) = z ( l ) p deg ( v ) deg ( u ) ,U l ( z v , h v , θ ) = σ ( W ( l ) h v ) . GATs (Veliˇckovi´c et al., 2018): The message function and the update function of this model is α ( l ) vu = exp( L EAKY R E LU ( a ( l ) ⊤ [ W ( l ) z v , W ( l ) z u ])) P u ′ ∈N ( v ) exp( L EAKY R E LU ( a ( l ) ⊤ [ W ( l ) z v , W ( l ) z u ′ ])) ,M lvu ( z v , z u , e vu , θ ) = α ( l ) vu z u ,U l ( z v , h v , θ ) = σ ( W ( l ) h v ) . Note that, technically, MPNNs do not include GATs because GATs use embeddings of other neighboring nodes to calculatethe attention value α vu . However, we can apply the same argument as MPNNs to GATs, and neighbor sampling canapproximate GATs in constant time as other MPNNs. To be precise, the following proposition holds true. Proposition 11.
If Assumptions 1 holds true and σ is Lipschitz continuous, if we take r ( L ) = O ( ε log δ ) and onstant Time Graph Neural Networks r (1) , . . . , r ( L − = O ( ε (log ε + log δ )) samples, and let ˆ z (0) v = z (0) v ˆ α ( l ) vu = exp( L EAKY R E LU ( a ( l ) ⊤ [ W ( l ) ˆ z ( l − v , W ( l ) ˆ z ( l − u ])) P u ′ ∈S ( l ) exp( L EAKY R E LU ( a ( l ) ⊤ [ W ( l ) ˆ z ( l − v , W ( l ) ˆ z ( l − u ′ ])) , ˆ h ( l ) v = X u ∈S ( l ) ˆ α ( l ) vu ˆ z ( l − u , ˆ z ( l ) v = U l (ˆ z ( l − v , ˆ h ( l ) v , θ ) . Then, the following property holds true. Pr [ k z ( L ) v − ˆ z ( L ) v k ≥ ε ] < δ. We prove Proposition 11 in Section C.
B. Experimental Setup
Experiments for Q1:
The speed is evaluated by Intel Xeon CPU E5-2690.
Experiments for Q4:
We initialized the weight matrix W (1) with normal distribution and then normalized it so thatthe operator norm k W (1) k op of the matrix was equal to . This satisfies Assumption 1, i.e., k θ k ≤ √ . For each r = 1 , . . . , , we (1) initialized the weight matrix, (2) choose nodes, (3) calculated the exact embedding ofeach chosen node, (4) calculated the estimate for each chosen node with r samples, i.e., r (1) = r , and (5) calculated theapproximation error of each chosen node. Experiments for Q5:
In the experiment for the BA model, we (1) iterated r from to , (2) set n = r , (3) generated10 graphs with n nodes using the BA model, (4) chose the node that has the maximum degree for each generated graph,(5) calculated the exact embeddings and its estimate for each chosen node with r samples, i.e., r (1) = r (2) = r , and (6)calculated the approximation error.The experimental process for the ER model was similar to that for the ER model, but we used the ER model instead of BAmodel and set n = floor ( r . ) instead of n = r to reduce the computational cost. Experiments for Q6:
To calculate the approximation errors of embeddings, for each trained model, for r = 1 . . . , we(1) calculated the exact embedding of each test node, (2) calculated an estimate of embedding of each test node with r samples, i.e., r (1) = r (2) = r , and (3) calculated the approximation error of each test node. To calculate the approximationerrors of gradients, for each dataset, we (1) initialized ten models with Xavier initializer (Glorot & Bengio, 2010), (2) choserandom training nodes, and (3) for each model, for each chosen node, and for r = 1 . . . , calculated the exact andapproximation gradients of the classification loss with respect to the parameters, and we calculated their approximationerror. C. Proofs
We introduce the following multivariate version of the Hoeffding’s inequality (Hoeffding, 1963) to prove the theoreticalbound (Theorem 1 and 4).
Lemma 12 (multivariate Hoeffding’s inequality) . Let x , x , . . . , x n be independent d -dimensional random variableswhose two-norms are bounded k x i k ≤ B , and let ¯ x be the empirical mean of these variables ¯ x = n P ni =1 x i . Then, forany ε > , Pr [ k ¯ x − E [¯ x ] k ≥ ε ] ≤ d exp (cid:18) − nε B d (cid:19) holds true. onstant Time Graph Neural Networks Lemma 12 states that the empirical mean of n = O ( ε log δ ) samples independently sampled from the same distributionis the approximation of the exact mean with an absolute error of at most ε and probability − δ . Lemma 13 (Hoeffding’s inequality (Hoeffding, 1963)) . Let X , X , . . . , X n be independent random variables boundedby the intervals [ − B, B ] and let ¯ X be the empirical mean of these variables ¯ X = n P ni =1 X i . Then, for any ε > , Pr [ | ¯ X − E [ ¯ X ] | ≥ ε ] ≤ (cid:18) − nε B (cid:19) holds true.Proof of Lemma 12. Apply Lemma 13 to each dimension k of X i . Then,Pr [ | ¯ X k − E [ ¯ X ] k | ≥ ε √ d ] ≤ (cid:18) − nε B d (cid:19) . It should be noted that | X ik | < B because k X i k < B . Therefore,Pr [ ∃ k ∈ { , , . . . , d } | ¯ X k − E [ ¯ X ] k | ≥ ε √ d ] ≤ d exp (cid:18) − nε B d (cid:19) . If | ¯ X k − E [ ¯ X ] k | < ε √ d holds true for all dimension k , then k ¯ X − E [ ¯ X ] k = vuut d X k =1 ( ¯ X k − E [ ¯ X ] k ) < r d · ε d = ε. Therefore, Pr [ k ¯ X − E [ ¯ X ] k ≥ ε ] ≤ d exp (cid:18) − nε B d (cid:19) . First, we prove that the embedding in each layer is bounded.
Lemma 14.
Under Assumptions 1 and 2, the norms of the embeddings k z ( l ) v k , k ˆ z ( l ) v k , k h ( l ) v k , and k ˆ h ( l ) v k ( l =1 , . . . , L ) are bounded by a constant B ∈ R .Proof of Lemma 14. We prove the theorem by performing mathematical induction. The norm of the input to the first layeris bounded by Assumption 1. The message function deg ( i ) M liu and the update function U l is continuous by Assumption2. Since the image f ( X ) of a compact set X ∈ R d is compact if f is continuous, the images of deg ( i ) M liu and U l arebounded by induction. Proof of Theorem 1.
We prove the theorem by performing mathematical induction on the number of layers L . Base case:
It is shown that the statement holds true for L = 1 .Because U L is uniform continuous, ∃ ε ′ > , ∀ z v , h v , h ′ v , θ , k h v − h ′ v k < ε ′ ⇒ k U L ( z v , h v , θ ) − U L ( z v , h ′ v , θ ) k < ε. (1)Let x k be the k -th sample in S ( L ) and X k = deg ( v ) M Lvu ( z (0) v , z (0) x k , e vx k , θ ) . Then, E [ X k ] = X u ∈N ( v ) M Lvu ( z (0) v , z (0) u , e vu , θ ) = h ( L ) v . (2) onstant Time Graph Neural Networks There exists a constant C ∈ R such that for any input satisfying Assumption 1, k X k k < C (3)holds true because k z (0) v k , k z (0) x k k , k e vx k k , and k θ k are bounded by Assumption 1 and deg ( v ) M Lvu is continuous.Therefore, if we take r ( L ) = O ( ε ′ log δ ) samples, Pr [ k ˆ h ( L ) v − h ( L ) v k ≥ ε ′ ] ≤ δ by the Hoeffding’s inequality andequations (2) and (3). Therefore, Pr [ k ˆ z ( L ) v − z ( L ) v k ≥ ε ] ≤ δ . Inductive step:
It is shown that the statement holds true for L = l + 1 if it holds true for L = l . The induction hypothesisis ∀ ε > , > δ > , ∃ r (1) ( ε, δ ) , . . . , r ( L − ( ε, δ ) such that ∀ v ∈ V , Pr [ k ˆ O ( L − z ( v, ε, δ ) − z v k ≥ ε ] ≤ δ .Because U L is uniform continuous, ∃ ε ′ > , ∀ z v , h v , z ′ v , h ′ v , θ , k [ z v , h v ] − [ z ′ v , h ′ v ] k < ε ′ ⇒ k U L ( z v , h v , θ ) − U L ( z ′ v , h ′ v , θ ) k < ε, (4)where [ · ] denotes concatenation. By the induction hypothesis, ∃ r ′ (1) , . . . , r ′ ( L − such that Pr [ k ˆ O ( L − z ( v ) − z ( L − v k ≥ ε ′ / √ ≤ δ/ . (5)holds true. Let ˜ h ( L ) v = deg ( v ) r ( L ) X u ∈S ( L ) M Lvu ( z ( L − v , z ( L − u , e vu , θ ) . Let x k be the k -th sample in S ( L ) and X k = deg ( v ) M Lvu ( z ( L − v , z ( L − x k , e vx k , θ ) . Then, E [ X k ] = X u ∈N ( v ) M Lvu ( z ( L − v , z ( L − x k , e vx k , θ ) = h ( L ) v . (6)There exists a constant C ∈ R such that for any input satisfying Assumption 1, k X k k < C, (7)because k z ( L − v k , k z ( L − x k k , k e vx k k , and k θ k are bounded by Assumption 1 and Theorem 14, and deg ( v ) M Lvu iscontinuous. If we take r ( L ) = O ( ε ′ log δ ) , thenPr [ k ˜ h ( l ) v − h ( l ) v k ≥ ε ′ / (2 √ ≤ δ/ , (8)by the Hoeffding’s inequality and equations (6) and (7). Because deg ( v ) M Lvu is uniform continuous, ∃ ε ′′ > such that k [ z ( L − v , z ( L − u ] − [ z ′ ( L − v , z ′ ( L − u ] k ≤ ε ′′ ⇒ deg ( v ) k M Lvu ( z ( L − v , z ( L − u , e vu , θ ) − M Lvu ( z ′ ( L − v , z ′ ( L − u , e vu , θ ) k ≤ ε ′ / (2 √ . (9)By the induction hypothesis, ∃ r ′′ (1) , . . . , r ′′ ( l ) such that Pr [ k ˆ O ( L − z ( v ) − z ( L − v k ≥ ε ′′ / √ ≤ δ/ (8 r ( L ) ) . (10)Therefore, the probability that the errors of all oracle calls are bounded isPr [ ∃ u ∈ S ( L ) , k [ ˆ O ( L − z ( v ) , ˆ O ( L − z ( u )] − [ z ( L − v , z ( L − u ] k ≥ ε ′′ ] ≤ δ/ . (11)By equations (9) and (11),Pr [ ∃ u ∈ S ( L ) , deg ( v ) k M Lvu ( z ( L − v , z ( L − u , e vu , θ ) − M Lvu (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) k ≥ ε ′ / (2 √ ≤ δ/ . onstant Time Graph Neural Networks Pr [ k ˜ h ( L ) v − ˆ h ( L ) v k ≥ ε ′ / (2 √ ≤ δ/ . (12)By the triangular inequality and equations (8) and (12),Pr [ k ˆ h ( L ) v − h ( L ) v k ≥ ε ′ / √ ≤ δ/ . (13)Therefore, if we take r (1) = max( r ′ (1) , r ′′ (1) ) , . . . , r ( L − = max( r ′ ( L − , r ′′ ( L − ) , by equations (5) and (13),Pr [ k [ z ( L − v , h ( L ) v ] − [ ˆ O ( L − z ( v ) , ˆ h ( L ) v ] k ≥ ε ′ ] ≤ δ. (14)Therefore, by equations (9) and (14), Pr [ k ˆ z ( L ) v − z ( L ) v k ≥ ε ] ≤ δ. Proof of Theorem 2.
We prove this by performing mathematical induction on the number of layers.
Base case:
It is shown that the statement holds true for L = 1 .If U L is K -Lipschitz continuous, ε ′ = O ( ε ) in equation (1). Therefore, r ( L ) = O ( ε log δ ) . Inductive step:
It is shown that the statement holds true for L = l + 1 if it holds true for L = l .If U L and M Lvu are K -Lipschitz continuous, ε ′ = O ( ε ) in equation (4) and ε ′′ = O ( ε ) in equation (9). Therefore, r ( L ) = O ( ε log δ ) . We call ˆ O ( L − z ( v ) such that Pr [ k ˆ O ( L − z ( v ) − z ( L − v k ≥ ε ′ / √ ≤ δ/ in equation (5). There-fore, r ′ (1) , . . . , r ′ ( L − = O ( ε (log ε + log δ ) are sufficient by the induction hypothesis. We call ˆ O ( L − z ( v ) such thatPr [ k ˆ O ( L − z ( v ) − z ( L − v k ≥ ε ′′ / √ ≤ δ/ (8 r ( L ) ) in equation (10). Therefore, r ′′ (1) , . . . , r ′′ ( L − = O ( ε (log ε + log δ ) are sufficient by the induction hypothesis because log δ/ (8 r ( L ) ) = O (log ε + log δ ) . In total, the complexity is O ( ε L (log ε + log ε ) L − log δ ) Lemma 15 ((Chazelle et al., 2005)) . Let D s be Bernoulli ( sε ) . Let n -dimentional distribution D be (1) pick s = 1 withprobability / and s = − otherwise; (2) then draw n values from D s . Any probabilistic algorithm that can guess thevalue of s with a probability error below / requires Ω( ε ) bit lookup on average.Proof of Theorem 3. We prove there is a counter example in the GraphSAGE-GCN models. Suppose there is an algorithmthat approximates the one-layer GraphSAGE-GCN within o ( ε ) queries. We prove that this algorithm can distinguish D in Lemma 15 within o ( ε ) queries and derive a contradiction.Let σ be any non-constant K -Lipschitz activation function. There exists a, b ∈ R ( a > b ) such that σ ( a ) = σ ( b ) because σ is not constant. Let S = | σ ( a ) − σ ( b ) | a − b > . Let ε > be any sufficiently small positive value and t ∈ { , } n be arandom variable drawn from D . We prove that we can determine s with high provability within o ( ε ) queries using thealgorithm. Let G be a clique K n and W (1) = 1 . Let us calculate a ε and b ε using the following steps: (1) set a ε = a and b ε = b ; (2) if a ε − b ε < ε , return a ε and b ε ; (3) m = a ε + b ε ; (4) if | σ ( a ε ) − σ ( m ) | > | σ ( m ) − σ ( b ε ) | , then set b ε = m ,otherwise a ε = m ; and (5) go back to (2). Here, ε/ ≤ a ε − b ε < ε , a ≤ a ε + b ε ≤ b , and | σ ( a ε ) − σ ( b ε ) | ≥ S ε holdtrue. Let x v = a ε + b ε + (2 t v − a ε − b ε ε for all v ∈ V . Then, E [ h v | s = 1] = a ε and E [ h v | s = −
1] = b ε . Therefore,Pr [ | z v − σ ( a ε ) | < S ε | s = 1] → as n → ∞ and Pr [ | z v − σ ( b ε ) | < S ε | s = − → as n → ∞ because σ is K -Lipschitz. We set the error tolerance to S ε and n to a sufficiently large number. Then s = 1 if | ˆ z v − σ ( a ε ) | < S ε and s = − otherwise with high probability. However, the algorithm accesses t (i.e., accesses O feature ) o ( ε ) times. Thiscontradicts with Lemma 15. Lemma 16.
Under Assumptions 1, 2 and 3, the norms of the gradients of the message functions andthe update functions k DU l ( z ( l − v , h ( l ) v , θ ) k F , k DU l (ˆ z ( l − v , ˆ h ( l ) v , θ ) k F , k deg ( v ) DM lvu ( z ( l − v , v ( l ) u , e vu , θ ) k F , and k deg ( v ) DM lvu (ˆ z ( l − v , ˆ v ( l ) u , e vu , θ ) k F are bounded by a constant B ′ ∈ R . onstant Time Graph Neural Networks Proof of Lemma 16.
The input of each function is bounded by Lemma 14. Because DU l and deg ( v ) DM lvu is uniformcontinuous, these images are bounded. Proof of Theorem 4.
We prove the theorem by performing mathematical induction on the number of layers L . Base case:
It is shown that the statement holds true for L = 1 .When the number of layers is one, ∂ z ( L ) v ∂ θ = ∂U L ∂ θ ( z (0) v , h ( L ) v , θ ) + ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) ∂ h ( L ) v ∂ θ = ∂U L ∂ θ ( z (0) v , h ( L ) v , θ ) + ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) X u ∈N ( v ) ∂M Lvu ∂ θ ( z (0) v , z (0) u , e vu , θ ) .∂ ˆ z ( L ) v ∂ θ = ∂U L ∂ θ ( z (0) v , ˆ h ( L ) v , θ ) + ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) ∂ ˆ h ( L ) v ∂ θ = ∂U L ∂ θ ( z (0) v , ˆ h ( L ) v , θ ) + ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ θ ( z (0) v , z (0) u , e vu , θ ) . Because DU L is uniform continuous, ∃ ε ′ > s.t. for all input, k h ( L ) v − ˆ h ( L ) k < ε ′ ⇒ ( k ∂U L ∂ θ ( z (0) v , h ( L ) v , θ ) − ∂U L ∂ θ ( z (0) v , ˆ h ( L ) v , θ ) k F < ε/ ∧k ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) − ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) k F < ε/ (4 B ′ )) . (15)If we take r ( L ) = O ( ε ′ log δ ) , Pr [ k h ( L ) v − ˆ h ( L ) k ≥ ε ′ ] ≤ δ/ (16)holds true for any input by the argument of the proof of Theorem 1.Let x k be the k -th sample in S ( L ) and X k = deg ( v ) ∂M Lvu ∂ θ ( z (0) v , z (0) u , e vu , θ ) . Then, E [ X k ] = X u ∈N ( v ) ∂M Lvu ∂ θ ( z (0) v , z (0) u , e vu , θ ) = ∂ h ( L ) v ∂ θ . (17)There exists a constant C ∈ R such that for any input satisfying Assumption 1, k X k k < C, (18)because k z (0) v k , k z (0) x k k , k e vx k k , and k θ k are bounded by Assumption 1 and deg ( v ) DM Lvu is continuous. Therefore,if we take r ( L ) = O ( ε log δ ) samples,Pr [ k ∂ ˆ h ( L ) v ∂ θ − ∂ h ( L ) v ∂ θ k F ≥ ε/ (4 B ′ )] ≤ δ/ (19) onstant Time Graph Neural Networks holds true by the Hoeffding’s inequality and equations (17) and (18). If k ∂ ˆ h ( L ) v ∂ θ − ∂ h ( L ) v ∂ θ k F < ε/ (4 B ′ ) and k ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) − ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) k F < ε/ (4 B ′ )) hold true, then k ∂ ˆ h ( L ) v ∂ θ ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) − ∂ h ( L ) v ∂ θ ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) k F ≤ k ∂ ˆ h ( L ) v ∂ θ ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) − ∂ ˆ h ( L ) v ∂ θ ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) k F + k ∂ ˆ h ( L ) v ∂ θ ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) − ∂ h ( L ) v ∂ θ ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) k F = k ∂ ˆ h ( L ) v ∂ θ k F k ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) − ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) k F + k ∂ ˆ h ( L ) v ∂ θ − ∂ h ( L ) v ∂ θ k F k ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) k F ≤ B ′ k ∂U L ∂ ˆ h ( L ) v ( z (0) v , ˆ h ( L ) v , θ ) − ∂U L ∂ h ( L ) v ( z (0) v , h ( L ) v , θ ) k F + B ′ k ∂ ˆ h ( L ) v ∂ θ − ∂ h ( L ) v ∂ θ k F < B ′ ε B ′ + B ′ ε B ′ = ε . Therefore, Pr [ k ∂ z ( L ) v ∂ θ − ∂ z ( L ) v ∂ θ k F ≥ ε ] ≤ δ by equations (15), (16), and (19). Inductive step:
It is shown that the statement holds true for L = l + 1 if it holds true for L = l . ∂ z ( L ) v ∂ θ = ∂U L ∂ θ ( z ( L − v , h ( L ) v , θ ) + ∂U L ∂ z ( L − v ( z ( L − v , h ( L ) v , θ ) ∂ z ( L − v ∂ θ + ∂U L ∂ h ( L ) v ( z ( L − v , h ( L ) v , θ ) X u ∈N ( v ) ∂M Lvu ∂ θ ( z ( L − v , z ( L − u , e vu , θ )+ ∂U L ∂ h ( L ) v ( z ( L − v , h ( L ) v , θ ) X u ∈N ( v ) ∂M Lvu ∂ z ( L − v ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − v ∂ θ + ∂U L ∂ h ( L ) v ( z ( L − v , h ( L ) v , θ ) X u ∈N ( v ) ∂M Lvu ∂ z ( L − u ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − u ∂ θ . onstant Time Graph Neural Networks ∂ ˆ z ( L ) v ∂ θ = ∂U L ∂ θ (ˆ z ( L − v , ˆ h ( L ) v , θ ) + ∂U L ∂ ˆ z ( L − v (ˆ z ( L − v , ˆ h ( L ) v , θ ) ∂ ˆ z ( L − v ∂ θ + ∂U L ∂ ˆ h ( L ) v (ˆ z ( L − v , ˆ h ( L ) v , θ ) deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ θ (ˆ z ( L − v , ˆ z ( L − u , e vu , θ )+ ∂U L ∂ ˆ h ( L ) v (ˆ z ( L − v , ˆ h ( L ) v , θ ) deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ ˆ z ( L − v (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) ∂ ˆ z ( L − v ∂ θ + ∂U L ∂ ˆ h ( L ) v (ˆ z ( L − v , ˆ h ( L ) v , θ ) deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ ˆ z ( L − u (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) ∂ ˆ z ( L − u ∂ θ . Because DU L is uniform continuous, ∃ ε ′ > s.t. for all input, k [ z ( L − v , h ( L ) v ] − [ˆ z ( L − v , ˆ h ( L ) v ] k < ε ′ ⇒ ( k ∂U L ∂ θ ( z ( L − v , h ( L ) v , θ ) − ∂U L ∂ θ (ˆ z ( L − v , ˆ h ( L ) v , θ ) k F < O ( ε ) ∧k ∂U L ∂ h ( L ) v ( z ( L − v , h ( L ) v , θ ) − ∂U L ∂ ˆ h ( L ) v (ˆ z ( L − v , ˆ h ( L ) v , θ ) k F < O ( ε ) ∧k ∂U L ∂ z ( L − v ( z ( L − v , h ( L ) v , θ ) − ∂U L ∂ ˆ z ( L − v (ˆ z ( L − v , ˆ h ( L ) v , θ ) k F < O ( ε )) . (20)Because deg ( v ) DM Lvu is uniform continuous, ∃ ε ′′ > s.t. for all input, k [ z ( L − v , z ( L − u ] − [ˆ z ( L − v , ˆ z ( L − u ] k < ε ′′ ⇒ ( deg ( v ) k ∂M Lvu ∂ θ ( z ( L − v , z ( L − u , e vu , θ ) − ∂U L ∂ θ (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) k F < O ( ε ) ∧ deg ( v ) k ∂M Lvu ∂ h ( L ) v ( z ( L − v , z ( L − u , e vu , θ ) − ∂U L ∂ ˆ h ( L ) v (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) k F < O ( ε ) ∧ deg ( v ) k ∂M Lvu ∂ z ( L − v ( z ( L − v , z ( L − u , e vu , θ ) − ∂U L ∂ ˆ z ( L − v (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) k F < O ( ε )) . (21)By the argument of the proof of Theorem 1, if we take sufficiently large number of samples,Pr [ k z ( L − v − ˆ z ( L − v k ≥ O (min( ε ′ , ε ′′ ))] ≤ O ( εδ ) , (22)Pr [ k h ( L ) v − ˆ h ( L ) v k ≥ O ( ε ′ )] ≤ O ( δ ) . (23)By the induction hypothesis, there exists r (1) , . . . , r ( L − such thatPr [ k ∂ z ( L − u ∂ θ − ∂ ˆ z ( L − u ∂ θ k F ≥ O ( ε )] ≤ O ( εδ ) , Pr [ k ∂ z ( L − v ∂ θ − ∂ ˆ z ( L − v ∂ θ k F ≥ O ( ε )] ≤ O ( εδ ) , onstant Time Graph Neural Networks If we take r ( L ) = O ( ε log δ ) ,Pr [ k X u ∈N ( v ) ∂M Lvu ∂ θ ( z ( L − v , z ( L − u , e vu , θ ) − deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ θ ( z ( L − v , z ( L − u , e vu , θ ) k ≥ O ( ε )] ≤ O ( δ ) , (24)Pr [ k X u ∈N ( v ) ∂M Lvu ∂ z ( L − v ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − v ∂ θ − deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ z ( L − v ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − v ∂ θ k ≥ O ( ε )] ≤ O ( δ ) , (25)Pr [ k X u ∈N ( v ) ∂M Lvu ∂ z ( L − u ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − u ∂ θ − deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ z ( L − u ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − u ∂ θ k ≥ O ( ε )] ≤ O ( δ ) , (26)holds true by the Hoeffding’s inequality. Therefore,Pr [ k X u ∈N ( v ) ∂M Lvu ∂ θ ( z ( L − v , z ( L − u , e vu , θ ) − deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ θ (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) k ≥ O ( ε )] ≤ O ( δ ) , (27)Pr [ k X u ∈N ( v ) ∂M Lvu ∂ z ( L − v ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − v ∂ θ − deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ ˆ z ( L − v (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) ∂ ˆ z ( L − v ∂ θ k ≥ O ( ε )] ≤ O ( δ ) , (28)Pr [ k X u ∈N ( v ) ∂M Lvu ∂ z ( L − u ( z ( L − v , z ( L − u , e vu , θ ) ∂ z ( L − u ∂ θ − deg ( v ) r ( L ) X u ∈S ( L ) ∂M Lvu ∂ ˆ z ( L − u (ˆ z ( L − v , ˆ z ( L − u , e vu , θ ) ∂ ˆ z ( L − u ∂ θ k ≥ O ( ε )] ≤ O ( δ ) , (29)holds true by equations (20), (21), (22), (23), (24), (25), and (26).Therefore, if we take r (1) , . . . r ( L ) sufficiently large, Pr [ k ∂ z ( L ) v ∂ θ − ∂ z ( L ) v ∂ θ k F ≥ ε ] ≤ δ holds true by equations (20), (21),(22), (23), (27), (28), and (29). Proof of Theorem 5.
We prove this by performing mathematical induction on the number of layers. onstant Time Graph Neural Networks
Base case:
It is shown that the statement holds true for L = 1 .If DU L is K ′ -Lipschitz continuous, ε ′ = O ( ε ) in equation (15). Therefore, r ( L ) = O ( ε log δ ) is sufficient. Inductive step:
It is shown that the statement holds true for L = l + 1 if it holds true for L = l .If DU L and deg ( v ) DU Lvu is K ′ -Lipschitz continuous, ε ′ = O ( ε ) in equation (20) and ε ′′ = O ( ε ) in equation (21).Therefore, r ( L ) = O ( ε log δ ) and r (1) , . . . , r ( L − = O ( ε (log ε + log δ )) are sufficient. Proof of Proposition 6.
We show that one-layer GraphSAGE-GCN whose activation function is not constant cannot beapproximated in constant time if k x v k or k θ k are not bounded. There exists a ∈ R such that σ ( a ) = σ (0) because σ isnot constant. We consider the following two types of inputs: • G is the clique K n , W (1) = 1 , and x i = 0 for all nodes i ∈ V . • G is the clique K n , W (1) = 1 , x i = 0( i = v ) for some v ∈ V , and x v = an .Then, for the former input, z (1) v = σ (0) . For the latter type of inputs, z (1) v = σ ( a ) . Let A be an arbitrary constantalgorithm and C be the number of queries A makes when we set ε = | σ ( a ) − σ (0) | / . When A calculates the embeddingof u = v ∈ V , the states of all nodes but u are symmetrical until A makes a query about that node. Therefore, if n issufficiently large, A does not make any query about v with high probability (i.e., at least (1 − n − ) C ). If A does not makeany query about v , the state of A is the same for both types of inputs. If the approximation error is less than ε for the firsttype of inputs, the approximation error is larger than ε for the second type of inputs by the triangle inequality and viceversa. Therefore, A fails to approximate the embeddings of either type of inputs with the absolute error of at most ε . Asfor θ , we set W (1) = an and x v = 1 for the second type of inputs. Then, the same argument follows. Proof of Proposition 7.
We consider the one-layer GraphSAGE-GCN with ReLU and normalization (i.e., σ ( x ) = R E LU ( x ) / k R E LU ( x ) k ). We use the following two types of inputs: • G is the clique K n , W (1) is the identity matrix I , x i = (0 , ⊤ ( i = v ) for some node v ∈ V , and x v = (1 , ⊤ . • G is the clique K n , W (1) is the identity matrix I , x i = (0 , ⊤ ( i = v ) for some node v ∈ V , and x v = (0 , ⊤ .Then, for the former type of inputs, h i = (1 /n, ⊤ , z i = (1 , ⊤ , and ∂z i ∂W = 1 for all i ∈ V . For the latter type ofinputs, h i = (0 , /n ) ⊤ , z i = (0 , ⊤ , and ∂ z i ∂W = 0 for all i ∈ V . Let A be an arbitrary constant algorithm and C bethe number of queries A makes when we set ε = 1 / . When A calculates the embedding or gradient of u = v ∈ V , thestates of all nodes but u are symmetrical until A makes a query about that node. Therefore, if n is sufficiently large, A does not make any query about v with high probability (i.e., at least (1 − n − ) C ). If A does not make any query about v ,the state of A is the same for both types of inputs. If the approximation error is less than ε for the first type of inputs, theapproximation error is larger than ε for the second type of inputs by the triangle inequality and vice versa. Therefore, A fails to approximate the embeddings and gradients of either type of inputs with the absolute error of at most ε . Proof of Proposition 8.
We consider the one-layer GraphSAGE-GCN with ReLU (i.e., σ ( x ) = R E LU ( x ) ). We use thefollowing two types of inputs: • G is the clique K n , W (1) = ( − , , x i = (1 , ⊤ ( i = v ) for some node v ∈ V , and x v = (1 , ⊤ . • G is the clique K n , W (1) = ( − , , x i = (1 , ⊤ ( i = v ) for some node v ∈ V , and x v = (1 , ⊤ .Then, for the former type of inputs, MEAN ( { x u | u ∈ N ( v ) } ) = (1 , n ) ⊤ , h v = z v = n , and ∂ z v ∂ W = (1 , n ) forall i ∈ V . For the latter type of inputs, MEAN ( { x u | u ∈ N ( v ) } ) = (1 , − n ) ⊤ , h v = − n , z v = 0 , and ∂ z v ∂ W = (0 , forall i ∈ V . Let A be an arbitrary constant algorithm and C be the number of queries A makes when we set ε = 1 / . When A calculates the gradient of u = v ∈ V , the states of all nodes but u are symmetrical until A makes a query about that node. onstant Time Graph Neural Networks Therefore, if n is sufficiently large, A does not make any query about v with high probability (i.e., at least (1 − n − ) C ).If A does not make any query about v , the state of A is the same for both types of inputs. If the approximation error isless than ε for the first type of inputs, the approximation error is larger than ε for the second type of inputs by the triangleinequality and vice versa. Therefore, A fails to approximate the gradients of either type of inputs with the absolute errorof at most ε . Proof of Proposition 9.
We consider the one-layer GraphSAGE-pool whose activation function satisfies σ (1) = σ (0) andthe following two types of inputs: • G is the clique K n , W (1) = 1 , b = 0 , and x i = 0 for all nodes v ∈ V . • G is the clique K n , W (1) = 1 , b = 0 , x i = 0 ( i = v ) for some node v ∈ V , and x v = 1 .Then, for the former type of inputs, z i = σ (0) for all i ∈ V . For the latter type of inputs, z i = σ (1) for all i ∈ V . Let A be an arbitrary constant algorithm and C be the number of queries A makes when we set ε = | σ (1) − σ (0) | / . When A calculates the embedding of u = v ∈ V , the states of all nodes but u are symmetrical until A makes a query about that node.Therefore, if n is sufficiently large, A does not make any query about v with high probability (i.e., at least (1 − n − ) C ).If A does not make any query about v , the state of A is the same for both types of inputs. If the approximation error isless than ε for the first type of inputs, the approximation error is larger than ε for the second type of inputs by the triangleinequality and vice versa. Therefore, A fails to approximate the embeddings of either type of inputs with the absolute errorof at most ε . Proof of Proposition 10.
We consider the one-layer GCNs whose activation function satisfies σ (1) = σ (0) . We use thefollowing two types of inputs: • G is a star graph, where v ∈ V is the center of G , W (1) = 1 , and all features are . • G is a star graph, where v ∈ V is the center of G , W (1) = 1 , and the features of √ n leafs are and the features ofother nodes are .Then, for the former type of inputs, z v = σ (0) . For the latter type of inputs, z v = σ (1) . Let A be an arbitrary constantalgorithm and C be the number of queries A makes when we set ε = | σ (1) − σ (0) | / . When A calculates the embeddingof u ∈ V that x u = 0 , the states of all nodes but u are symmetrical until A makes a query about that node. Therefore, if n is sufficiently large, A does not make any query about v with high probability (i.e., at least (1 − √ nn − ) C ). If A does notmake any query about v , the state of A is the same for both types of inputs. If the approximation error is less than ε for thefirst type of inputs, the approximation error is larger than ε for the second type of inputs by the triangle inequality and viceversa. Therefore, A fails to approximate the embeddings of either type of inputs with the absolute error of at most ε .We prove the following lemma to prove Proposition 11. Lemma 17.
If Assumptions 1 holds true and σ is Lipschitz continuous, k z ( l ) v k and k z ( l ) v k ( l = 1 , . . . , L ) of the GATmodel are bounded by a constantProof of Lemma 17. We prove this by performing mathematical induction on the number of layers. The norm of the inputof the first layer is bounded by Assumption 1. If k z ( l − u k and k ˆ z ( l − u k are bounded for all u ∈ V , k h ( l ) v k and k ˆ h ( l ) v k are bounded because h ( l ) v and ˆ h ( l ) v are the weighted sum of z ( l − u and ˆ z ( l − u . Therefore, k z ( l ) u k and k ˆ z ( l ) u k are boundedbecause U l is continuous. Proof of Proposition 11.
We prove the theorem by performing mathematical induction on the number of layers L . Base case:
It is shown that the statement holds true for L = 1 . onstant Time Graph Neural Networks Because U L is Lipschitz continuous, ∀ z v , h v , h ′ v , θ , k h v − h ′ v k < O ( ε ) ⇒ k U L ( z v , h v , θ ) − U L ( z v , h ′ v , θ ) k < ε. (30)Let e u = exp( L EAKY R E LU ( a ( l ) ⊤ [ W (0) z (0) v , W (0) z (0) u u ])) . Then, ˆ h ( L ) v − h ( L ) v = X u ∈S ( L ) ˆ α vu z (0) u − X u ∈N ( v ) α vu z (0) u = 1 r ( L ) X u ∈S ( L ) e u r ( L ) P u ′ ∈S ( L ) e u ′ z (0) u − deg ( v ) X u ∈N ( v ) e u deg ( v ) P u ′ ∈N ( v ) e u ′ z (0) u . Let x k be the k -th sample in S ( L ) and X k = e x k . Then, E [ X k ] = 1 N ( v ) X u ∈N ( v ) e u . (31)There exists a constant c > , C > such that for any input satisfying Assumption 1, c < | X k | < C, (32)because k z (0) v k , k z (0) x k k , k W (0) k F , and k a (0) k are bounded by Assumption 1. Therefore, if we take r ( L ) = O ( ε log δ ) samples, Pr [ | r ( L ) X u ∈S ( L ) e u − N ( v ) X u ∈N ( v ) e u | ≥ O ( ε )] ≤ O ( δ ) by the Hoeffding’s inequality and equations (31) and (32). Because f ( x ) = 1 /x is Lipschitz continuous in x > c > ,Pr [ | r ( L ) P u ∈S ( L ) e u − N ( v ) P u ∈N ( v ) e u | ≥ O ( ε )] ≤ O ( δ ) (33)Let Y k = e x k N ( v ) P u ′ ∈N ( v ) e u ′ z (0) x k . Then, E [ Y k ] = 1 N ( v ) X u ∈N ( v ) e u N ( v ) P u ′ ∈N ( v ) e u ′ z (0) u . (34)There exists a constant C ′ ∈ R such that for any input satisfying Assumption 1, k Y k k < C ′ (35)holds true because k z (0) u k are bounded, and c < | e u | < C . Therefore, if we take r ( L ) = O ( ε log δ ) samples,Pr [ k r ( L ) X u ∈S ( L ) e u deg ( v ) P u ′ ∈N ( v ) e u ′ z (0) u − deg ( v ) X u ∈N ( v ) e u deg ( v ) P u ′ ∈N ( v ) e u ′ z (0) u k ≥ O ( ε )] ≤ O ( δ ) (36) onstant Time Graph Neural Networks holds true by the Hoeffding’s inequality and equations (34) and (35). Therefore,Pr [ k r ( L ) X u ∈S ( L ) e u r ( L ) P u ′ ∈S ( L ) e u ′ z (0) u − deg ( v ) X u ∈N ( v ) e u deg ( v ) P u ′ ∈N ( v ) e u ′ z (0) u k ≥ O ( ε )] ≤ O ( δ ) (37)holds true by the triangle inequality and equations (33) and (36), and Pr [ k ˆ z ( L ) v − z ( L ) v k ≥ ε ] ≤ δ holds true by equations(30) and (37). Inductive step:
It is shown that the statement holds true for L = l + 1 if it holds true for L = l .Because U L is Lipschitz continuous, ∀ z v , h v , h ′ v , θ , k h v − h ′ v k < O ( ε ) ⇒ k U L ( z v , h v , θ ) − U L ( z v , h ′ v , θ ) k < ε (38)holds true. Pr [ k r ( L ) X u ∈S ( L ) e u r ( L ) P u ′ ∈S ( L ) e u ′ z ( L − u − deg ( v ) X u ∈N ( v ) e u deg ( v ) P u ′ ∈N ( v ) e u ′ z ( L − u k ≥ O ( ε )] ≤ O ( δ ) (39)holds true by the same argument as the base step. If we take r (1) , . . . , r ( L − = O ( ε (log ε + log δ )) samples,Pr [ k ˆ z ( L − u − z ( L − u k ≥ O ( ε )] ≤ O ( εδ ) (40)holds true by the induction hypothesis. Therefore, Pr [ k ˆ z ( L ) v − z ( L ) v k ≥ ε ] ≤ δ holds true by equations (38), (39), and(40). D. Computational Model Assumptions
In this study, we modeled our algorithm as an oracle machine that can make queries about the input and measured thecomplexity by query complexity. Modeling our algorithm as an oracle machine and measuring the complexity by querycomplexity are reasonable owing to the following reasons: • In a realistic setting, data is stored in a storage or cloud and we may not be able to load all the information of a hugenetwork on to the main memory. Sometimes the input graph is constructed on demand (e.g., in web graph mining,the edge information is retrieved when queried). In such cases, reducing the number of queries is crucial becauseaccessing storage or cloud is very expensive. • Our algorithm executes a constant number of elementary operations of O (log n ) bits (e.g., accessing the O ( n ) -thaddress, sampling one element from O ( n ) elements). Therefore, if we assume that these operations can be donein constant time, the total computational complexity of our algorithms will be constant. This assumption is naturalbecause most computers in the real-world can handle bit integers at once and most of the network data contain lessthan ≈ nodes. • Even if the above assumption is not satisfied, our algorithm can be executed in O (log n ) time in terms of the strictmeaning of computational complexity. This indicates that our algorithm is still sub-linear and therefore it scaleswell. It should be noted that it is impossible to access even a single node in o (log n ) time in the strict meaning ofcomputational complexity because we cannot distinguish n nodes with o (log n ) bits. Therefore, our algorithm hasoptimal complexity with respect to the number of nodes n . onstant Time Graph Neural Networks Table2. Time complexity of embedding algorithms. ∆ denotes the maximum degree. It should be noted that in the dense graph, O ( m ) = O ( n ) by definition. Sparse DenseProposed O ( ε L (log ε + log δ ) L − log δ ) O ( ε L (log ε + log δ ) L − log δ ) Exact O (∆ L ) O ( mL ) = O ( n L ) E. Graph Embedding
Our method can be extended to graph embedding, which embeds an entire graph instead of a node of a graph. It can becalculated by aggregating the embeddings of all nodes (Gilmer et al., 2017): z G = READOUT ( { z i | i ∈ V } ) . We adopt the mean of the feature vectors of the nodes as the readout function ( i.e., z G = n P i ∈ V z i ). However, wecannot calculate the embeddings of all nodes in constant time even if each calculation is done in constant time because ittakes a total of Ω( n ) time. We adopt the sampling strategy here as well. We sample some nodes in a uniformly randommanner, compute their feature vectors in constant time using Algorithm 2, and calculate their empirical mean. The errorsof sampling and Algorithm 2 are bounded by Lemma 12 and Theorem 1, respectively. Therefore, we sample a sufficientlylarge (but independent of the graph size) number of nodes and call Algorithm 2 with sufficiently small ε and δ . Then, theestimate is arbitrarily close to the exact embedding of G with an arbitrary probability. F. Time Complexity
We summarize the time complexity of approximation and exact algorithms in Table 2. Let B L = { v } and B l = S u ∈B l +1 N ( u ) ( l = 0 , . . . , L − . In other words, B L − k is the k -hop neighbors of node v . All features of B arerequired to calculate the exact embedding of node v . If the graph is sparse, the size of B L − k grows exponentially withrespect to k because deg ( u ) nodes are added to B L − k +1 for each node u ∈ B L − k . Namely, it is bounded by ∆ k . If thegraph is dense, B L − k ≈ V (1 ≤ k ≤ L ) Therefore, complexity is linear with respect to the number of layers L and edges mm