[PDF] Mean-field theory of graph neural networks in graph partitioning

Abstract

A theoretical performance analysis of the graph neural network (GNN) is presented. For classification tasks, the neural network approach has the advantage in terms of flexibility that it can be employed in a data-driven manner, whereas Bayesian inference requires the assumption of a specific model. A fundamental question is then whether GNN has a high accuracy in addition to this flexibility. Moreover, whether the achieved performance is predominately a result of the backpropagation or the architecture itself is a matter of considerable interest. To gain a better insight into these questions, a mean-field theory of a minimal GNN architecture is developed for the graph partitioning problem. This demonstrates a good agreement with numerical experiments.

Full PDF

MMean-ﬁeld theory of graph neural networksin graph partitioning

Tatsuro Kawamoto, Masashi Tsubaki

Artiﬁcial Intelligence Research Center,National Institute of Advanced Industrial Science and Technology,2-3-26 Aomi, Koto-ku, Tokyo, Japan {kawamoto.tatsuro, tsubaki.masashi}@aist.go.jp

Tomoyuki Obuchi

Department of Mathematical and Computing Science, Tokyo Institute of Technology,2-12-1 Ookayama Meguro-ku Tokyo, Japan [email protected]

Abstract

A theoretical performance analysis of the graph neural network (GNN) is presented.For classiﬁcation tasks, the neural network approach has the advantage in termsof ﬂexibility that it can be employed in a data-driven manner, whereas Bayesianinference requires the assumption of a speciﬁc model. A fundamental question isthen whether GNN has a high accuracy in addition to this ﬂexibility. Moreover,whether the achieved performance is predominately a result of the backpropagationor the architecture itself is a matter of considerable interest. To gain a betterinsight into these questions, a mean-ﬁeld theory of a minimal GNN architecture isdeveloped for the graph partitioning problem. This demonstrates a good agreementwith numerical experiments.

Deep neural networks have been subject to signiﬁcant attention concerning many tasks in machinelearning, and a plethora of models and algorithms have been proposed in recent years. The applicationof the neural network approach to problems on graphs is no exception and is being actively studied,with applications including social networks and chemical compounds [1, 2]. A neural network modelon graphs is termed a graph neural network (GNN) [3]. While excellent performances of GNNs havebeen reported in the literature, many of these results rely on experimental studies, and seem to be basedon the blind belief that the nonlinear nature of GNNs leads to such strong performances. However,when a deep neural network outperforms other methods, the factors that are really essential should beclariﬁed: Is this thanks to the learning of model parameters, e.g., through the backpropagation [4], orrather the architecture of the model itself? Is the choice of the architecture predominantly crucial,or would even a simple choice perform sufﬁciently well? Moreover, does the GNN genericallyoutperform other methods?To obtain a better understanding of these questions, not only is empirical knowledge based onbenchmark tests required, but also theoretical insights. To this end, we develop a mean-ﬁeld theoryof GNN, focusing on a problem of graph partitioning. The problem concerns a GNN with randommodel parameters, i.e., an untrained GNN. If the architecture of the GNN itself is essential, then theperformance of the untrained GNN should already be effective. On the other hand, if the ﬁne-tuningof the model parameters via learning is crucial, then the result for the untrained GNN is again usefulto observe the extent to which the performance is improved.

Preprint. Work in progress. a r X i v : . [ c s . L G ] O c t ++ = ( x t , x t , ... , x t ) Layer t i A ……… …… X t N ϕ j k ϕ • W t +1 ReadoutclassifiersFeature vector of i Layer t + 1 • W t X t +1 i i i iD ( p , p , ... , p ) i i iK Layer out … i j …… X T … …… … ……… • W out Figure 1: Architecture of the GNN considered in this paper.Table 1: Comparison of various methods under the framework of Eq. (2).algorithm domain M φ ( x ) ϕ ( x ) W t b t { W t , b t } updateuntrained GNN V A tanh I ( x ) random omitted not trainedtrained GNN [5] V I − L ReLu I ( x ) trained omitted trained via backprop.Spectral method V L I ( x ) I ( x ) QR / updated at each layerEM + BP (cid:126)E B softmax log( x ) learned learned learned via M-stepFor a given graph G = ( V, E ) , where V is the set of vertices and E is the set of (undirected)edges, the graph partitioning problem involves assigning one out of K group labels to each vertex.Throughout this paper, we restrict ourselves to the case of two groups ( K = 2 ). The problem settingfor graph partitioning is relatively simple compared with other GNN applications. Thus, it is suitableas a baseline for more complicated problems. There are two types of graph partitioning problem:One is to ﬁnd the best partition for a given graph under a certain objective function. The other is toassume that a graph is generated by a statistical model, and infer the planted (i.e., preassigned) grouplabels of the generative model. Herein, we consider the latter problem.Before moving on to the mean-ﬁeld theory, we ﬁrst clarify the algorithmic relationship between GNNand other methods of graph partitioning. The goal of this paper is to examine the graph partitioning performance using a minimal GNNarchitecture. To this end, we consider a GNN with the following feedforward dynamics. Each vertexis characterized by a D -dimensional feature vector whose elements are x iµ ( i ∈ V , µ ∈ { , . . . , D } ),and the state matrix X = [ x iµ ] obeys x t +1 iµ = (cid:88) jν A ij φ (cid:0) x tjν (cid:1) W tνµ + b tiµ . (1)Throughout this paper, the layers of the GNN are indexed by t ∈ { , . . . , T } . Furthermore, φ ( x ) is anonlinear activation function, b t = [ b tiµ ] is a bias term , and W t = [ W tνµ ] is a linear transform thatoperates on the feature space. To infer the group assignments, a D × K matrix W out of the readoutclassiﬁer is applied at the end of the last layer. Although there is no restriction on φ in our mean-ﬁeldtheory, we adopt φ = tanh as a speciﬁc choice in the experiment below. As there is no detailedattribute on each vertex, the initial state X is set to be uniformly random, and the adjacency matrix A is the only input in the present case. For the graphs with vertex attributes, deep neural networksthat utilize such information [6–8] are also proposed. The bias term is only included in this section, and will be omitted in later sections. { b t } , the linear transforms { W t } in the intermediate layers, and W out forthe readout classiﬁer are initially set to be random. These are updated through the backpropagationusing the training set. In the semi-supervised setting, (real-world) graphs in which only a portion ofthe vertices are correctly labeled are employed as a training set. On the other hand, in the case ofunsupervised learning, graph instances of a statistical model can be employed as the training set.The GNN architecture described above can be thought of as a special case of the following moregeneral form: x t +1 iµ = (cid:88) j M ij ϕ (cid:32)(cid:88) ν φ ( x tjν ) W tνµ (cid:33) + b tiµ , (2)where ϕ ( x ) is another activation function. With different choices for the matrix and activationfunctions shown in Table 1, various algorithm can be obtained. Equation (1) is recovered by setting M ij = A ij and ϕ ( x ) = I ( x ) (where I ( x ) is the identity function), while [5] employed a Laplacian-like matrix M = I − L = D − / AD − / , where D − / ≡ diag( √ d , . . . , √ d N ) ( d i is the degreeof vertex i ) and L is called the normalized Laplacian [9].The spectral method using the power iteration is obtained in the limit where φ ( x ) and ϕ ( x ) are linearand b t is absent. For the simultaneous power iteration that extracts the leading K eigenvectors, thestate matrix X is set as an N × K matrix whose column vectors are mutually orthogonal. Whilethe normalized Laplacian L is commonly adopted for M , there are several other choices [9–11]. At each iteration, the orthogonality condition is maintained via QR decomposition [12], i.e., for Z t := M X t , X t +1 = Z t R − t , where R − t acts as W t . R t is a D × D upper triangular matrixthat is obtained by the QR decomposition of Z t . Therefore, rather than training W t , it is determinedat each layer based on the current state.The belief propagation (BP) algorithm (also called the message passing algorithm) in Bayesianinference also falls under the framework of Eq. (2). While the domain of the state consists of thevertices ( i, j ∈ V ) for GNNs, this algorithm deals with the directed edges i → j ∈ (cid:126)E , where (cid:126)E isobtained by putting directions to every undirected edge. In this case, the state x tσ,i → j represents thelogarithm of the marginal probability that vertex i belongs to the group σ with the missing informationof vertex j at the t th iteration. With the choice of matrix and activation functions shown in Table1 (EM+BP), Eq. (2) becomes exactly the update equation of the BP algorithm [13]. The matrix M = B = [ B j → k,i → j ] is the so-called non-backtracking matrix [14], and the softmax functionrepresents the normalization of the state x tσ,i → j .The BP algorithm requires the model parameters W t and b t as inputs. For example, when theexpectation-maximization (EM) algorithm is considered, the BP algorithm comprises half (the E-step)of the algorithm. The parameter learning of the model is conducted in the other half (the M-step),which can be performed analytically using the current result of the BP algorithm. Here, W t and b t are the estimates of the so-called density matrix (or afﬁnity matrix) and the external ﬁeld resultingfrom messages from non-edges [13], respectively, and are common for every t . Therefore, thedifferences between the EM algorithm and GNN are summarized as follows. While there is ananalogy between the inference procedures, in the EM algorithm the parameter learning of the modelis conducted analytically, at the expense of the restrictions of the assumed statistical model. On theother hand, in GNNs the learning is conducted numerically in a data-driven manner [15], for exampleby backpropagation. While we will shed light on the detailed correspondence in the case of graphpartitioning here, the relationship between GNN and BP is also mentioned in [16, 17]. We analyze the performance of an untrained GNN on the stochastic block model (SBM). This is arandom graph model with a planted group structure, and is commonly employed as the generative Alternatively, it can be regarded that diag( b ti , . . . , b tiD ) is added to W t when b t has common rows. A matrix const . I is sometimes added in order to shift the eigenvalues. Precisely speaking, this is the BP algorithm in which the stochastic block model (SBM) is assumed as thegenerative model. The SBM is explained below. | V | = N , and each of the vertices has a preassigned group label σ ∈ { , . . . , K } , i.e., V = ∪ σ V σ . We deﬁne V σ as the set of vertices in a group σ , γ σ ≡ | V σ | /N , and σ i represents theplanted group assignment of vertex i . For each pair of vertices i ∈ V σ and j ∈ V σ (cid:48) , an edge isgenerated with probability ρ σσ (cid:48) , which is an element of the density matrix. Throughout this paper,we assume that ρ σσ (cid:48) = O ( N − ) , so that the resulting graph has a constant average degree, or inother words the graph is sparse. We denote the average degree by c . Therefore, the adjacency matrix A = [ A ij ] of the SBM is generated with probability p ( A ) = (cid:89) i µ g + 2 σ g for at least one graph instance among 30samples is indicated by the dark purple points, and the region with C − C > µ g + σ g is indicatedby the pale purple points, where µ g and σ g are the mean and standard deviation, respectively, of thecovariance gap in the information-theoretically undetectable region. In Fig. 2b, the elements of thecovariance matrix are compared for the SBM with c = 8 . The consistency of our mean-ﬁeld estimateis examined for a speciﬁc implementation in Section 5. The previous section dealt with the feedforward process of an untrained GNN. By employing aclassiﬁer such as the k-means method [30], W out is not required, and the inference of the SBMcan be performed without any training procedure. To investigate whether the training signiﬁcantlyimproves the performance, the algorithm that updates the matrices { W t } and W out must be speciﬁed.The cross-entropy error function is commonly employed to train W out for a classiﬁcation task.However, this error function unfortunately cannot be directly applied to the present case. Note thatthe planted group assignment of SBM is invariant under a global permutation of the group labels.In other words, as long as the set of vertices in the same group share the same label, the label itselfcan be anything. This is called the identiﬁability problem [31]. The cross-entropy error functionis not invariant under a global label permutation, and thus the classiﬁer cannot be trained unlessthe degrees of freedom are constrained. A possible brute force approach is to explicitly evaluateall the permutations in the error function [32], although this obviously results in a considerablecomputational burden unless the number of groups is very small. Note also that semi-supervisedclustering does not suffer from the identiﬁability issue, because the permutation symmetry is explicitlybroken. 5ere, we instead propose the use of the normalized mutual information (NMI) as an error function forthe readout classiﬁer. The NMI is a comparison measure of two group assignments, which naturallyeliminates the permutation degrees of freedom. Let σ = { σ i = σ | i ∈ V σ } be the labels of the plantedgroup assignments, and ˆ σ = { σ i = ˆ σ | i ∈ V ˆ σ } be the labels of the estimated group assignments.First, the (unnormalized) mutual information is deﬁned as I ( σ , ˆ σ ) = K (cid:88) σ =1 K (cid:88) ˆ σ =1 P σ ˆ σ log P σ ˆ σ P σ P ˆ σ , (8)where the joint probability P σ ˆ σ is the fraction of vertices that belong to the group σ in the plantedassignment and the group ˆ σ in the estimated assignment. Furthermore, P σ and P ˆ σ are the marginalsof P σ ˆ σ , and we let H ( σ ) and H ( ˆ σ ) be the corresponding entropies. The NMI is deﬁned by NMI( σ , ˆ σ ) ≡ I ( σ , ˆ σ ) H ( σ ) + H ( ˆ σ ) . (9)We adopt this measure as the error function for the readout classiﬁer. For the resulting state x Tiµ ,the estimated assignment probability p iσ that vertex i belongs to the group σ is deﬁned as p iσ ≡ softmax( a iσ ) , where a iσ = (cid:80) µ x iµ W out µσ . Each element of the NMI is then obtained as P σ ˆ σ = 1 N N (cid:88) i =1 P ( i ∈ V σ , i ∈ V ˆ σ ) = 1 N (cid:88) i ∈ V σ p i ˆ σ ,P σ = (cid:88) ˆ σ P σ ˆ σ = γ σ , P ˆ σ = (cid:88) σ P σ ˆ σ = 1 N N (cid:88) i =1 p i ˆ σ . (10)In summary, NMI ([ P σ ˆ σ ]) = 2 (cid:18) − (cid:80) σ ˆ σ P σ ˆ σ log P σ ˆ σ (cid:80) σ γ σ log γ σ + (cid:80) σ ˆ σ P σ ˆ σ log (cid:80) σ P σ ˆ σ (cid:19) . (11)This measure is permutation invariant, because the NMI counts the label co-occurrence patterns foreach vertex in σ and ˆ σ . First, the consistency between our mean-ﬁeld theory and a speciﬁc implementation of an untrainedGNN is examined. The performance of the untrained GNN is evaluated by drawing phase diagrams.For the SBMs with various values for the average degree c and the strength of group structure (cid:15) , theoverlap, i.e., the fraction of vertices that coincide with their planted labels, is calculated. Afterward,it is investigated whether a signiﬁcant improvement is achieved through the parameter learning of themodel. Note that because even a completely random clustering can correctly infer half of the labelson average, the minimum of the overlap is . . As mentioned above, we adopt φ = tanh as thespeciﬁc choice of activation function. We evaluate the performance of the untrained GNN in which the resulting state X is read out usingthe k-means (more precisely k-means++ [30]) classiﬁer. In this case, no parameter learning takesplace. We set the dimension of the feature space to D = 100 and the number of layers to T = 100 ,and each result represents the average over samples.Figure 3a presents the corresponding phase diagram. The overlap is indicated by colors, and the solidline represents the detectability limit estimated by Eq. (7). The dashed line represents the mean-ﬁeldestimate of the detectability limit of the spectral method [33, 34], and the shaded area represents For this reason, the overlap is sometimes standardized such that the minimum equals zero. Again, there are several choices for the matrix to be adopted in the spectral method. However, the Laplaciansand modularity matrix, for example, have the same detectability limit when the graph is regular or the averagedegree is sufﬁciently large. Undetectable Detectable O v e r l a p (a) (b) O v e r l a p N Figure 3: Performance of the untrained GNN using the k-means classiﬁer. (a) The same detectabilityphase diagram as in Fig. 2. The heatmap represents the overlap obtained using the untrained GNN. (b)The overlaps of the SBM with c = 8 : The light shaded area represents the region above the estimateusing Eq. (7), the dashed line represents the detectability limit of the spectral method, and the darkshaded region represents the information-theoretically undetectable region.the region above which the inference is information-theoretically impossible. It is known that thedetectability limit of the BP algorithm [13] coincides with this information-theoretic limit so longas the model parameters are correctly learned. For the present model, it is also known that the EMalgorithm can indeed learn these parameters [35]. Note that it is natural that a Bayesian methodwill outperform others as long as a consistent model is used, whereas it may perform poorly if theassumed model is not consistent.It can be observed that our mean-ﬁeld estimate exhibits a good agreement with the numericalexperiment. For a closer view, the overlaps of multiple graph sizes with c = 8 are presented in Fig. 3b.For c = 8 , the estimate is (cid:15) ∗ ≈ . , and this appears to coincide with the point at which the overlapis almost . . It should be noted that the performance can vary depending on the implementationdetails. For example, while the k-means method is performed to X T in the present experiment, itcan instead be performed to φ (cid:16) X T (cid:17) . An experiment concerning such a case is presented in thesupplemental material. Now, we consider a trained GNN, and compare its performance with the untrained one. A setof SBM instances is provided as the training set. This consists of , SBM instances with N = 5 , , where an average degree c ∈ { , , . . . , } and strength of the group structure (cid:15) ∈ { . , . , . . . , . } are adopted. For the validation (development) set, graph instances ofthe same SBMs are provided. Finally, the SBMs with various values of (cid:15) and the average degree c = 8 are provided as the test set.We evaluated the performance of a GNN trained by backpropagation. We implemented the GNNusing Chainer (version 3.2.0) [36]. As in the previous section, the dimension of the feature space isset to D = 100 , and various numbers of layers are examined. For the error function of the readoutclassiﬁer, we adopted the NMI error function described in Section 4. The model parameters areoptimized using the default setting of the Adam optimizer [37] in Chainer. Although we examinedvarious optimization procedures for ﬁne-tuning, the improvement was hardly observable.We also employ residual networks (ResNets) [38] and batch normalization (BN) [39]. These arealso adopted in [32]. The ResNet imposes skip (or shortcut) connections on a deep network, i.e., x t +1 iµ = (cid:80) jν A ij φ (cid:0) x tjν (cid:1) W tνµ + x t − siµ , where s is the number of layers skipped, and is set as s = 5 .The BN layer, which standardizes the distribution of the state X t , is placed at each intermediatelayer t . Finally, we note that the parameters of deep GNNs (e.g., T > ) cannot be learned correctlywithout using the ResNet and BN techniques.The results using the GNN trained as above are illustrated in Fig. 4. First, it can be observed fromFig. 4a that a deep structure is important for a better accuracy. For sufﬁciently deep networks, the7 .2 0.40.50.60.70.80.91.0 O v e r l a p N O v e r l a p T (a) (b) Figure 4: Overlaps of the GNN with trained model parameters. (a) The overlaps of the GNN withvarious number of layers T on the SBM with c = 8 and N = 5 , . (b) The graph size dependenceof the overlap of the GNN with T = 100 on the SBM with c = 8 . In both cases, the shaded regionsand dashed line are plotted in the same manner as in Fig. 3b.overlaps obtained by the trained GNN are clearly better than those of the untrained counterpart (seeFig. 3b). On the other hand, the region of (cid:15) where the overlap suddenly deteriorates still coincideswith our mean-ﬁeld estimate for the untrained GNN. This implies that in the limit N → ∞ , thedetectability limit is not signiﬁcantly improved by training. To demonstrate the ﬁnite-size effectin the result of Fig. 4a, the overlaps of various graph sizes are plotted in Fig. 4b. The variation ofoverlaps becomes steeper around (cid:15) ∗ ≈ . as the graph size is increased, implying the presence ofdetectability phase transition around the value of (cid:15) predicted by our mean-ﬁeld estimate.The untrained and trained GNNs exhibit a clear difference in overlap when X T is employed as thereadout classiﬁer. However, it should be noted that the untrained GNN where φ ( X T ) is adopted asthe readout classiﬁer exhibits a performance close to that of the trained GNN. The reader should alsobear in mind that the computational cost required for training is not negligible. In a minimal GNN model, the adjacency matrix A is employed for the connections between inter-mediate layers. In fact, there have been many attempts [5, 32, 40, 41] to adopt a more complexarchitecture rather than A . Furthermore, other types of applications of deep neural networks to graphpartitioning or related problems have been described [6, 7, 42]. The number of GNN varieties can bearbitrarily extended by modifying the architecture and learning algorithm. Again, it is important toclarify which elements are essential for the performance.The present study offers a baseline answer to this question. Our mean-ﬁeld theory and numericalexperiment using the k-means readout classiﬁer clarify that an untrained GNN with a simple architec-ture already performs well. It is worth noting that our mean-ﬁeld theory yields an accurate estimate ofthe detectability limit in a compact form. The learning of the model parameters by backpropagationdoes contribute to an improved accuracy, although this appears to be quantitatively insigniﬁcant.Importantly, the detectability limit appears to remain (almost) the same.The minimal GNN that we considered in this paper is not the state of the art for the inference ofthe symmetric SBM. However, as described in Section 2, an advantage of the GNN is its ﬂexibility,in that the model can be learned in a data-driven manner. For a more complicated example, suchas the graphs of chemical compounds in which each vertex has attributes, the GNN is expected togenerically outperforms other approaches. In such a case, the performance may be signiﬁcantlyimproved thanks to backpropagation. This would constitute an interesting direction for future work.In addition, the adequacy of the NMI error function that we introduced for the readout classiﬁershould be examined in detail. 8 cknowledgments The authors are grateful to Ryo Karakida for helpful comments. This work was supported by the NewEnergy and Industrial Technology Development Organization (NEDO) (T.K. and M. T.) and JSPSKAKENHI No. 18K11463 (T. O.).

References [1] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

International Conference on Knowledge Discovery and Data Mining , 2016.[2] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alan Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learningmolecular ﬁngerprints. In

Advances in Neural Information Processing Systems , 2015.[3] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The graph neural network model.

Trans. Neur. Netw. , 20(1):61–80, January 2009.[4] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations byback-propagating errors. nature , 323(6088):533, 1986.[5] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[6] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neuralnetworks for graphs. In

International conference on machine learning , pages 2014–2023, 2016.[7] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors,

Advances in Neural Information Processing Systems 30 , pages 1024–1034.Curran Associates, Inc., 2017.[8] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architecturesfrom sequence and graph kernels. In Doina Precup and Yee Whye Teh, editors,

Proceedings ofthe 34th International Conference on Machine Learning , volume 70 of

Proceedings of MachineLearning Research , pages 2024–2033, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.[9] Ulrike Luxburg. A tutorial on spectral clustering.

Statistics and Computing , 17(4):395–416,August 2007.[10] M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices.

Phys. Rev. E , 74:036104, 2006.[11] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensionalstochastic blockmodel.

The Annals of Statistics , 39(4):1878–1915, 2011.[12] Gene H Golub and Charles F Van Loan.

Matrix computations , volume 3. JHU Press, 2012.[13] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptoticanalysis of the stochastic block model for modular networks and its algorithmic applications.

Phys. Rev. E , 84:066106, 2011.[14] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zde-borová, and Pan Zhang. Spectral redemption in clustering sparse networks.

Proc. Natl. Acad.Sci. U.S.A. , 110(52):20935–40, December 2013.[15] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. GraphRNN: Adeep generative model for graphs. arXiv preprint arXiv:1802.08773 , 2018.[16] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. arXiv preprint arXiv:1704.01212 , 2017.917] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, andMax Welling. Modeling relational data with graph convolutional networks. arXiv preprintarXiv:1703.06103 , 2017.[18] Tiago P. Peixoto. Efﬁcient monte carlo and greedy heuristic for the inference of stochastic blockmodels.

Phys. Rev. E , 89:012804, Jan 2014.[19] Tiago P Peixoto. Bayesian stochastic blockmodeling. arXiv preprint arXiv:1705.10225 , 2017.[20] M. E. J. Newman and Gesine Reinert. Estimating the number of communities in a network.

Phys. Rev. Lett. , 117:078301, Aug 2016.[21] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the plantedpartition model.

Probability Theory and Related Fields , 162(3):431–461, Aug 2015.[22] Laurent Massoulié. Community detection thresholds and the weak ramanujan property. In

Proceedings of the 46th Annual ACM Symposium on Theory of Computing , STOC ’14, pages694–703, New York, 2014. ACM.[23] Cristopher Moore. The computer science and physics of community detection: landscapes,phase transitions, and hardness. arXiv preprint arXiv:1702.00467 , 2017.[24] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.Exponential expressivity in deep neural networks through transient chaos. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 29 , pages 3360–3368. Curran Associates, Inc., 2016.[25] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-tion propagation. arXiv preprint arXiv:1611.01232 , 2016.[26] H. Sompolinsky and Annette Zippelius. Relaxational dynamics of the edwards-anderson modeland the mean-ﬁeld theory of spin-glasses.

Phys. Rev. B , 25:6860–6875, Jun 1982.[27] A. Crisanti and H. Sompolinsky. Dynamics of spin systems with randomly asymmetric bonds:Ising spins and glauber dynamics.

Phys. Rev. A , 37:4865–4874, Jun 1988.[28] A Crisanti, HJ Sommers, and H Sompolinsky. Chaos in neural networks: chaotic solutions. preprint , 1990.[29] Manfred Opper, Burak Çakmak, and Ole Winther. A theory of solving tap equations for isingmodels with general invariant random matrices.

Journal of Physics A: Mathematical andTheoretical , 49(11):114002, 2016.[30] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In

Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathe-matics.[31] Krzysztof Nowicki and Tom A. B Snijders. Estimation and prediction for stochastic blockstruc-tures.

Journal of the American Statistical Association , 96(455):1077–1087, 2001.[32] Joan Bruna and Xiang Li. Community detection with graph neural networks. arXiv preprintarXiv:1705.08415 , 2017.[33] Tatsuro Kawamoto and Yoshiyuki Kabashima. Limitations in the spectral method for graphpartitioning: Detectability threshold and localization of eigenvectors.

Phys. Rev. E , 91:062803,Jun 2015.[34] T. Kawamoto and Y. Kabashima. Detectability of the spectral method for sparse graph partition-ing.

Eur. Phys. Lett. , 112(4):40007, 2015.[35] Tatsuro Kawamoto. Algorithmic detectability threshold of the stochastic block model.

Phys.Rev. E , 97:032301, Mar 2018. 1036] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation opensource framework for deep learning. In

Proceedings of workshop on machine learning systems(LearningSys) in the twenty-ninth annual conference on neural information processing systems ,2015.[37] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,2016.[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[40] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locallyconnected networks on graphs. arXiv preprint arXiv:1312.6203 , 2013.[41] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networkson graphs with fast localized spectral ﬁltering. In D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, editors,

Advances in Neural Information Processing Systems 29 , pages3844–3852. Curran Associates, Inc., 2016.[42] Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang.Modularity based community detection with deep learning. In

Proceedings of the Twenty-FifthInternational Joint Conference on Artiﬁcial Intelligence , IJCAI’16, pages 2252–2258. AAAIPress, 2016. 11

Derivation of the self-consistent equation

In this section, the detailed derivation of the self-consistent equation of the covariance matrix isderived. Here, we recast our starting-point equation: (cid:90) D ˆ x t +1 D x t +1 e −L (cid:10) e L (cid:11) A,W t ,X t , (cid:40) L = (cid:80) σµ γ σ ˆ x t +1 σµ x t +1 σµ L = N (cid:80) µν (cid:80) ij A ij W tνµ ˆ x t +1 σ i µ φ ( x tjν ) . , (12) A.1 Random averages over W t and A We ﬁrst take the average of exp( L ) over W t . The Gaussian integral with respect to W t yields (cid:10) e L (cid:11) W t = exp  DN (cid:88) µν (cid:88) ijk(cid:96) A ij ˆ x t +1 σ i µ φ ( x tjν ) A k(cid:96) ˆ x t +1 σ k µ φ ( x t(cid:96)ν )  = exp (cid:32) D (cid:88) σ σ σ σ (cid:88) µ ˆ x t +1 σ µ ˆ x t +1 σ µ (cid:88) ν ψ t,νσ σ ψ t,νσ σ (cid:33) = exp (cid:32) D (cid:88) σ σ u t +1 σ σ v tσ σ (cid:33) , (13)where we introduce the following quantities: ψ t,νσσ (cid:48) ≡ N (cid:88) i ∈ V σ (cid:88) j ∈ V σ (cid:48) A ij φ ( x tjν ) , (14) u t +1 σσ (cid:48) ≡ D (cid:88) µ ˆ x t +1 σµ ˆ x t +1 σ (cid:48) µ , (15) v tσσ (cid:48) ≡ D (cid:88) µ (cid:88) ˜ σ ˜ σ (cid:48) ψ t,µσ ˜ σ ψ t,µσ (cid:48) ˜ σ (cid:48) . (16)Therefore, Eq. (12) can be written as (cid:90) D ˆ x t +1 D x t +1 (cid:42)(cid:90) D ˆ u t +1 D u t +1 (cid:90) D ˆ v t D v t (cid:90) D ˆ ψ t D ψ t × (cid:28) exp (cid:34) −L + D (cid:88) σσ (cid:48) u t +1 σσ (cid:48) v tσσ (cid:48) − (cid:88) σσ (cid:48) ˆ u t +1 σσ (cid:48) (cid:32) Du t +1 σσ (cid:48) − (cid:88) µ ˆ x t +1 σµ ˆ x t +1 σ (cid:48) µ (cid:33) − (cid:88) σσ (cid:48) ˆ v tσσ (cid:48) (cid:32) Dv tσσ (cid:48) − (cid:88) ν (cid:88) ˜ σ ˜ σ (cid:48) ψ t,νσ ˜ σ ψ t,νσ (cid:48) ˜ σ (cid:48) (cid:33) − (cid:88) σσ (cid:48) (cid:88) µ ˆ ψ t,µσσ (cid:48)  ψ t,µσσ (cid:48) − N (cid:88) i ∈ V σ (cid:88) j ∈ V σ (cid:48) A ij φ ( x tjν ) (cid:35)(cid:29) A (cid:43) X t . (17)Note that as we will see below, u t +1 , v t , ψ t , and their conjugates are related to X t , and thus theaverage over X t is taken outside of their integral.We next take the average over a random graph. In Eq. (17), only the ﬁnal term in the exponent isrelevant to A . We denote this term as L . We also let Ξ tij = (cid:80) µ (cid:16) ˆ ψ t,µσ i σ j φ ( x tjµ ) + ˆ ψ t,µσ j σ i φ ( x tiµ ) (cid:17) /N .12ecause the graph is generated from the SBM, we have that (cid:10) e L (cid:11) A = (cid:89) i

Here, we compare Eq. (24) with a Markovian discrete-time stochastic process y t +1 σ = η tσ , in whicheach element is correlated via a random noise, i.e., (cid:104) η tσ (cid:105) η = 0 , and (cid:104) η tσ η tσ (cid:48) (cid:105) η = C σσ (cid:48) for any t . The13orresponding normalization condition reads (cid:90) (cid:89) σ dy t +1 σ (cid:42)(cid:89) σ δ (cid:0) y t +1 σ − η tσ (cid:1)(cid:43) η = (cid:90) (cid:89) σ γ σ d ˆ y t +1 σ dy t +1 σ πi (cid:68) e − (cid:80) σ γ σ ˆ y t +1 σ ( y t +1 σ − η tσ ) (cid:69) η = (cid:90) D ˆ y t +1 D y t +1 e − (cid:80) σ γ σ ˆ y t +1 σ y t +1 σ (cid:90) (cid:89) σ dη tσ πi exp (cid:32) − (cid:88) σσ (cid:48) η tσ C − σσ (cid:48) η tσ (cid:48) + (cid:88) σ γ σ ˆ y t +1 σ η tσ (cid:33) = (cid:90) D ˆ y t +1 D y t +1 exp (cid:32) − (cid:88) σ γ σ ˆ y t +1 σ y t +1 σ + 12 (cid:88) σσ (cid:48) ˆ y t +1 σ γ σ C σσ (cid:48) γ σ (cid:48) ˆ y t +1 σ (cid:48) (cid:33) . (26)Analogously to the case of the GNN, we have deﬁned D ˆ y t +1 D y t +1 ≡ (cid:81) σ γ σ d ˆ y t +1 σ dy t +1 σ / πi . A.3 Self-consistent equation

Finally, we compare Eqs. (24) and (26). However, note that these are not of exactly the same form,because the average over X t is taken outside of the exponential in Eq. (24). Two approximationsare made in order to derive the self-consistent equation, and the assumptions that justify theseapproximations are discussed afterward.First, if the approximation (cid:42) exp (cid:32) (cid:88) σσ (cid:48) ˆ x t +1 σ F σσ (cid:48) (cid:0) X t (cid:1) ˆ x t +1 σ (cid:48) (cid:33)(cid:43) X t ≈ exp (cid:32) (cid:88) σσ (cid:48) ˆ x t +1 σ (cid:10) F σσ (cid:48) (cid:0) X t (cid:1)(cid:11) X t ˆ x t +1 σ (cid:48) (cid:33) (27)holds in the stationary limit, then the group-wise state x t can be regarded as a Gaussian variablewhose correlation matrix obeys C σσ (cid:48) = 1 γ σ γ σ (cid:48) (cid:88) σ σ B σ ˜ σ B σ (cid:48) ˜ σ (cid:48) (cid:42) γ ˜ σ N γ ˜ σ (cid:48) N (cid:88) i ∈ V ˜ σ (cid:88) j ∈ V ˜ σ (cid:48) φ ( x i ) φ ( x j ) (cid:43) X t . (28)This equation is still not closed, because the right-hand side of Eq. (28) depends on the statistic of X t , rather than x t . However, because the vertices within the group σ are statistically equivalent, { x i } i ∈ V σ are expected to obey the same distribution with mean x σ , which itself is a random variable.If (cid:80) i ∈ V σ φ ( x i ) / ( γ σ N ) ≈ φ ( x σ ) holds, then the right-hand side of Eq. (28) can be evaluated as theaverage with respect to the group-wise variable x t . Then, within this regime we arrive at the followingself-consistent equation with respect to the covariance matrix C = [ C σσ (cid:48) ] : C σσ (cid:48) = 1 γ σ γ σ (cid:48) (cid:88) ˜ σ ˜ σ (cid:48) B σ ˜ σ B σ (cid:48) ˜ σ (cid:48) (cid:90) d x e − x (cid:62) C − x (2 π ) N √ det C φ ( x ˜ σ ) φ ( x ˜ σ (cid:48) ) . (29)Let us consider the ﬁrst approximation that we adopted in Eq. (27). In the terminology of physics, thisis the replacement of a free energy with an internal energy, or the neglect of the entropic contribution.It is difﬁcult to evaluate this residual in general. However, note that this becomes closer to equality asevery x i approaches the same value. Therefore, this implies that the self-consistent equation is moreaccurate as we approach the detectability limit, and yields an adequate estimate of the critical value.Let us next consider the second approximation we adopted in Eq. (28). Although the law oflarge numbers with respect to φ ( x i ) (not x i ) ensures that (cid:80) i ∈ V σ φ ( x i ) / ( γ σ N ) has a certain valuecharacterized by the group, this may be different from φ ( x σ ) . In fact, the relation between these is ingeneral an inequality (Jensen’s inequality) when the activation function φ is a convex function. The(exact) equality holds only when { x i } is constant or the function φ is linear within the considereddomain.The second approximation can be justiﬁed in the following cases. The ﬁrst case is when the ﬂuctuationof x i − x σ i is negligible compared to the magnitude of x σ . Note that this is the same assumptionas we made in the ﬁrst approximation. To see this precisely, let us express x i as x i = x σ + z i for i ∈ V σ . We can formally write the probability distribution P ( { x i } ) of { x i } in a hierarchical fashionas follows: P ( { x i } ) = (cid:90) (cid:89) σ d x σ (cid:90) (cid:89) i dz i P σ ( { x σ } ) P ( { x i } ) (cid:89) i δ ( z i − x i + x σ i ) , (30)14 .1 0.2 0.3 0.4 0.50.50.60.70.80.91.0 O v e r l a p N c Undetectable Detectable O v e r l a p (a) (b) Figure 5: Performance of the untrained GNN using the k-means classiﬁer with φ ( X ) . (a) Thedetectability phase diagram and (b) the overlaps of the SBM with c = 8 are plotted in the samemanner as in Fig. 3 in the main text. When the variation of the overlap is interpolated for each graphsize, these curves are crossed at (cid:15) ∗ ≈ . . It implies the presence of detectability phase transitionaround the value of (cid:15) predicted by our mean-ﬁeld estimate.where P σ ( { x σ } ) is the probability distribution with respect to x . Thus, the expectation (cid:104) f ( X ) (cid:105) X can be expressed as (cid:104) f ( X ) (cid:105) X ≡ (cid:90) (cid:89) i dx i P ( { x i } ) f ( { x i } )= (cid:90) (cid:89) σ d x σ (cid:90) (cid:89) i dz i P ( { z i }|{ x σ } ) P σ ( { x σ } ) f ( { x σ i + z i } ) , (31)where P ( { z i }|{ x σ } ) ≡ P ( { x σ i + z i } ) , which can be a nontrivial function. However, whenever thecontributions from the average with respect to z i are negligible, Eq. (31) implies that the expectationin Eq. (28) can be evaluated using only the group-wise variables { x σ } . Another case is when theactivation function φ is almost linear within the domain over which z i ﬂuctuates. For example, in thecase that φ = tanh , the present approximation does not deteriorate the accuracy even when x σ ≈ .When either of these assumption holds, the equality of Jensen’s inequality is approximately satisﬁed,and our derivation of the self-consistent equation is justiﬁed. B K-means classiﬁcation using φ ( X ) Instead of X T , φ ( X T ) can be adopted to perform the k-means classiﬁcation after the feedforwardprocess. Again, we employ tanh as the nonlinear activation function. The results of an untrainedGNN and a trained GNN under the same experimental settings as in the main text are illustratedin Fig. 5 and Fig. 6, respectively. In Fig. 5a, the reader should note that the range of the colorgradient is different from that in the phase diagram in the main text. For the untrained GNN, theobtained overlaps are clearly better than that using X T . It can be understood that the error is reducedbecause the nonlinear function drives each element of the state X T to either +1 or − , making theclassiﬁcation using the k-means method easier and more accurate. On the other hand, for the trainedGNN, differences between the overlaps using X T and φ ( X T ) are hardly observable.Particularly for the case of an untrained GNN in which φ ( X T ) is adopted for the readout classiﬁer,the overlap gradually changes around the estimated detectability limit. This may be as result of thestrong ﬁnite-size effect. Again, note that our estimate of the detectability limit is for the case that N → ∞ . 15 .2 0.40.50.60.70.80.91.0 O v e r l a p N O v e r l a p T (a) (b) Figure 6: Performance of the trained GNN using the classiﬁer with φ ( X ))