Mean-field theory of graph neural networks in graph partitioning
MMean-field theory of graph neural networksin graph partitioning
Tatsuro Kawamoto, Masashi Tsubaki
Artificial Intelligence Research Center,National Institute of Advanced Industrial Science and Technology,2-3-26 Aomi, Koto-ku, Tokyo, Japan {kawamoto.tatsuro, tsubaki.masashi}@aist.go.jp
Tomoyuki Obuchi
Department of Mathematical and Computing Science, Tokyo Institute of Technology,2-12-1 Ookayama Meguro-ku Tokyo, Japan [email protected]
Abstract
A theoretical performance analysis of the graph neural network (GNN) is presented.For classification tasks, the neural network approach has the advantage in termsof flexibility that it can be employed in a data-driven manner, whereas Bayesianinference requires the assumption of a specific model. A fundamental question isthen whether GNN has a high accuracy in addition to this flexibility. Moreover,whether the achieved performance is predominately a result of the backpropagationor the architecture itself is a matter of considerable interest. To gain a betterinsight into these questions, a mean-field theory of a minimal GNN architecture isdeveloped for the graph partitioning problem. This demonstrates a good agreementwith numerical experiments.
Deep neural networks have been subject to significant attention concerning many tasks in machinelearning, and a plethora of models and algorithms have been proposed in recent years. The applicationof the neural network approach to problems on graphs is no exception and is being actively studied,with applications including social networks and chemical compounds [1, 2]. A neural network modelon graphs is termed a graph neural network (GNN) [3]. While excellent performances of GNNs havebeen reported in the literature, many of these results rely on experimental studies, and seem to be basedon the blind belief that the nonlinear nature of GNNs leads to such strong performances. However,when a deep neural network outperforms other methods, the factors that are really essential should beclarified: Is this thanks to the learning of model parameters, e.g., through the backpropagation [4], orrather the architecture of the model itself? Is the choice of the architecture predominantly crucial,or would even a simple choice perform sufficiently well? Moreover, does the GNN genericallyoutperform other methods?To obtain a better understanding of these questions, not only is empirical knowledge based onbenchmark tests required, but also theoretical insights. To this end, we develop a mean-field theoryof GNN, focusing on a problem of graph partitioning. The problem concerns a GNN with randommodel parameters, i.e., an untrained GNN. If the architecture of the GNN itself is essential, then theperformance of the untrained GNN should already be effective. On the other hand, if the fine-tuningof the model parameters via learning is crucial, then the result for the untrained GNN is again usefulto observe the extent to which the performance is improved.
Preprint. Work in progress. a r X i v : . [ c s . L G ] O c t ++ = ( x t , x t , ... , x t ) Layer t i A ……… …… X t N ϕ j k ϕ • W t +1 ReadoutclassifiersFeature vector of i Layer t + 1 • W t X t +1 i i i iD ( p , p , ... , p ) i i iK Layer out … i j …… X T … …… … ……… • W out Figure 1: Architecture of the GNN considered in this paper.Table 1: Comparison of various methods under the framework of Eq. (2).algorithm domain M φ ( x ) ϕ ( x ) W t b t { W t , b t } updateuntrained GNN V A tanh I ( x ) random omitted not trainedtrained GNN [5] V I − L ReLu I ( x ) trained omitted trained via backprop.Spectral method V L I ( x ) I ( x ) QR / updated at each layerEM + BP (cid:126)E B softmax log( x ) learned learned learned via M-stepFor a given graph G = ( V, E ) , where V is the set of vertices and E is the set of (undirected)edges, the graph partitioning problem involves assigning one out of K group labels to each vertex.Throughout this paper, we restrict ourselves to the case of two groups ( K = 2 ). The problem settingfor graph partitioning is relatively simple compared with other GNN applications. Thus, it is suitableas a baseline for more complicated problems. There are two types of graph partitioning problem:One is to find the best partition for a given graph under a certain objective function. The other is toassume that a graph is generated by a statistical model, and infer the planted (i.e., preassigned) grouplabels of the generative model. Herein, we consider the latter problem.Before moving on to the mean-field theory, we first clarify the algorithmic relationship between GNNand other methods of graph partitioning. The goal of this paper is to examine the graph partitioning performance using a minimal GNNarchitecture. To this end, we consider a GNN with the following feedforward dynamics. Each vertexis characterized by a D -dimensional feature vector whose elements are x iµ ( i ∈ V , µ ∈ { , . . . , D } ),and the state matrix X = [ x iµ ] obeys x t +1 iµ = (cid:88) jν A ij φ (cid:0) x tjν (cid:1) W tνµ + b tiµ . (1)Throughout this paper, the layers of the GNN are indexed by t ∈ { , . . . , T } . Furthermore, φ ( x ) is anonlinear activation function, b t = [ b tiµ ] is a bias term , and W t = [ W tνµ ] is a linear transform thatoperates on the feature space. To infer the group assignments, a D × K matrix W out of the readoutclassifier is applied at the end of the last layer. Although there is no restriction on φ in our mean-fieldtheory, we adopt φ = tanh as a specific choice in the experiment below. As there is no detailedattribute on each vertex, the initial state X is set to be uniformly random, and the adjacency matrix A is the only input in the present case. For the graphs with vertex attributes, deep neural networksthat utilize such information [6–8] are also proposed. The bias term is only included in this section, and will be omitted in later sections. { b t } , the linear transforms { W t } in the intermediate layers, and W out forthe readout classifier are initially set to be random. These are updated through the backpropagationusing the training set. In the semi-supervised setting, (real-world) graphs in which only a portion ofthe vertices are correctly labeled are employed as a training set. On the other hand, in the case ofunsupervised learning, graph instances of a statistical model can be employed as the training set.The GNN architecture described above can be thought of as a special case of the following moregeneral form: x t +1 iµ = (cid:88) j M ij ϕ (cid:32)(cid:88) ν φ ( x tjν ) W tνµ (cid:33) + b tiµ , (2)where ϕ ( x ) is another activation function. With different choices for the matrix and activationfunctions shown in Table 1, various algorithm can be obtained. Equation (1) is recovered by setting M ij = A ij and ϕ ( x ) = I ( x ) (where I ( x ) is the identity function), while [5] employed a Laplacian-like matrix M = I − L = D − / AD − / , where D − / ≡ diag( √ d , . . . , √ d N ) ( d i is the degreeof vertex i ) and L is called the normalized Laplacian [9].The spectral method using the power iteration is obtained in the limit where φ ( x ) and ϕ ( x ) are linearand b t is absent. For the simultaneous power iteration that extracts the leading K eigenvectors, thestate matrix X is set as an N × K matrix whose column vectors are mutually orthogonal. Whilethe normalized Laplacian L is commonly adopted for M , there are several other choices [9–11]. At each iteration, the orthogonality condition is maintained via QR decomposition [12], i.e., for Z t := M X t , X t +1 = Z t R − t , where R − t acts as W t . R t is a D × D upper triangular matrixthat is obtained by the QR decomposition of Z t . Therefore, rather than training W t , it is determinedat each layer based on the current state.The belief propagation (BP) algorithm (also called the message passing algorithm) in Bayesianinference also falls under the framework of Eq. (2). While the domain of the state consists of thevertices ( i, j ∈ V ) for GNNs, this algorithm deals with the directed edges i → j ∈ (cid:126)E , where (cid:126)E isobtained by putting directions to every undirected edge. In this case, the state x tσ,i → j represents thelogarithm of the marginal probability that vertex i belongs to the group σ with the missing informationof vertex j at the t th iteration. With the choice of matrix and activation functions shown in Table1 (EM+BP), Eq. (2) becomes exactly the update equation of the BP algorithm [13]. The matrix M = B = [ B j → k,i → j ] is the so-called non-backtracking matrix [14], and the softmax functionrepresents the normalization of the state x tσ,i → j .The BP algorithm requires the model parameters W t and b t as inputs. For example, when theexpectation-maximization (EM) algorithm is considered, the BP algorithm comprises half (the E-step)of the algorithm. The parameter learning of the model is conducted in the other half (the M-step),which can be performed analytically using the current result of the BP algorithm. Here, W t and b t are the estimates of the so-called density matrix (or affinity matrix) and the external field resultingfrom messages from non-edges [13], respectively, and are common for every t . Therefore, thedifferences between the EM algorithm and GNN are summarized as follows. While there is ananalogy between the inference procedures, in the EM algorithm the parameter learning of the modelis conducted analytically, at the expense of the restrictions of the assumed statistical model. On theother hand, in GNNs the learning is conducted numerically in a data-driven manner [15], for exampleby backpropagation. While we will shed light on the detailed correspondence in the case of graphpartitioning here, the relationship between GNN and BP is also mentioned in [16, 17]. We analyze the performance of an untrained GNN on the stochastic block model (SBM). This is arandom graph model with a planted group structure, and is commonly employed as the generative Alternatively, it can be regarded that diag( b ti , . . . , b tiD ) is added to W t when b t has common rows. A matrix const . I is sometimes added in order to shift the eigenvalues. Precisely speaking, this is the BP algorithm in which the stochastic block model (SBM) is assumed as thegenerative model. The SBM is explained below. | V | = N , and each of the vertices has a preassigned group label σ ∈ { , . . . , K } , i.e., V = ∪ σ V σ . We define V σ as the set of vertices in a group σ , γ σ ≡ | V σ | /N , and σ i represents theplanted group assignment of vertex i . For each pair of vertices i ∈ V σ and j ∈ V σ (cid:48) , an edge isgenerated with probability ρ σσ (cid:48) , which is an element of the density matrix. Throughout this paper,we assume that ρ σσ (cid:48) = O ( N − ) , so that the resulting graph has a constant average degree, or inother words the graph is sparse. We denote the average degree by c . Therefore, the adjacency matrix A = [ A ij ] of the SBM is generated with probability p ( A ) = (cid:89) i
References [1] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
International Conference on Knowledge Discovery and Data Mining , 2016.[2] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alan Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learningmolecular fingerprints. In
Advances in Neural Information Processing Systems , 2015.[3] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The graph neural network model.
Trans. Neur. Netw. , 20(1):61–80, January 2009.[4] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations byback-propagating errors. nature , 323(6088):533, 1986.[5] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[6] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neuralnetworks for graphs. In
International conference on machine learning , pages 2014–2023, 2016.[7] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors,
Advances in Neural Information Processing Systems 30 , pages 1024–1034.Curran Associates, Inc., 2017.[8] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architecturesfrom sequence and graph kernels. In Doina Precup and Yee Whye Teh, editors,
Proceedings ofthe 34th International Conference on Machine Learning , volume 70 of
Proceedings of MachineLearning Research , pages 2024–2033, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.[9] Ulrike Luxburg. A tutorial on spectral clustering.
Statistics and Computing , 17(4):395–416,August 2007.[10] M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices.
Phys. Rev. E , 74:036104, 2006.[11] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensionalstochastic blockmodel.
The Annals of Statistics , 39(4):1878–1915, 2011.[12] Gene H Golub and Charles F Van Loan.
Matrix computations , volume 3. JHU Press, 2012.[13] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptoticanalysis of the stochastic block model for modular networks and its algorithmic applications.
Phys. Rev. E , 84:066106, 2011.[14] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zde-borová, and Pan Zhang. Spectral redemption in clustering sparse networks.
Proc. Natl. Acad.Sci. U.S.A. , 110(52):20935–40, December 2013.[15] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. GraphRNN: Adeep generative model for graphs. arXiv preprint arXiv:1802.08773 , 2018.[16] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. arXiv preprint arXiv:1704.01212 , 2017.917] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, andMax Welling. Modeling relational data with graph convolutional networks. arXiv preprintarXiv:1703.06103 , 2017.[18] Tiago P. Peixoto. Efficient monte carlo and greedy heuristic for the inference of stochastic blockmodels.
Phys. Rev. E , 89:012804, Jan 2014.[19] Tiago P Peixoto. Bayesian stochastic blockmodeling. arXiv preprint arXiv:1705.10225 , 2017.[20] M. E. J. Newman and Gesine Reinert. Estimating the number of communities in a network.
Phys. Rev. Lett. , 117:078301, Aug 2016.[21] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the plantedpartition model.
Probability Theory and Related Fields , 162(3):431–461, Aug 2015.[22] Laurent Massoulié. Community detection thresholds and the weak ramanujan property. In
Proceedings of the 46th Annual ACM Symposium on Theory of Computing , STOC ’14, pages694–703, New York, 2014. ACM.[23] Cristopher Moore. The computer science and physics of community detection: landscapes,phase transitions, and hardness. arXiv preprint arXiv:1702.00467 , 2017.[24] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.Exponential expressivity in deep neural networks through transient chaos. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 29 , pages 3360–3368. Curran Associates, Inc., 2016.[25] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-tion propagation. arXiv preprint arXiv:1611.01232 , 2016.[26] H. Sompolinsky and Annette Zippelius. Relaxational dynamics of the edwards-anderson modeland the mean-field theory of spin-glasses.
Phys. Rev. B , 25:6860–6875, Jun 1982.[27] A. Crisanti and H. Sompolinsky. Dynamics of spin systems with randomly asymmetric bonds:Ising spins and glauber dynamics.
Phys. Rev. A , 37:4865–4874, Jun 1988.[28] A Crisanti, HJ Sommers, and H Sompolinsky. Chaos in neural networks: chaotic solutions. preprint , 1990.[29] Manfred Opper, Burak Çakmak, and Ole Winther. A theory of solving tap equations for isingmodels with general invariant random matrices.
Journal of Physics A: Mathematical andTheoretical , 49(11):114002, 2016.[30] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathe-matics.[31] Krzysztof Nowicki and Tom A. B Snijders. Estimation and prediction for stochastic blockstruc-tures.
Journal of the American Statistical Association , 96(455):1077–1087, 2001.[32] Joan Bruna and Xiang Li. Community detection with graph neural networks. arXiv preprintarXiv:1705.08415 , 2017.[33] Tatsuro Kawamoto and Yoshiyuki Kabashima. Limitations in the spectral method for graphpartitioning: Detectability threshold and localization of eigenvectors.
Phys. Rev. E , 91:062803,Jun 2015.[34] T. Kawamoto and Y. Kabashima. Detectability of the spectral method for sparse graph partition-ing.
Eur. Phys. Lett. , 112(4):40007, 2015.[35] Tatsuro Kawamoto. Algorithmic detectability threshold of the stochastic block model.
Phys.Rev. E , 97:032301, Mar 2018. 1036] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation opensource framework for deep learning. In
Proceedings of workshop on machine learning systems(LearningSys) in the twenty-ninth annual conference on neural information processing systems ,2015.[37] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,2016.[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[40] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locallyconnected networks on graphs. arXiv preprint arXiv:1312.6203 , 2013.[41] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networkson graphs with fast localized spectral filtering. In D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems 29 , pages3844–3852. Curran Associates, Inc., 2016.[42] Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang.Modularity based community detection with deep learning. In
Proceedings of the Twenty-FifthInternational Joint Conference on Artificial Intelligence , IJCAI’16, pages 2252–2258. AAAIPress, 2016. 11
Derivation of the self-consistent equation
In this section, the detailed derivation of the self-consistent equation of the covariance matrix isderived. Here, we recast our starting-point equation: (cid:90) D ˆ x t +1 D x t +1 e −L (cid:10) e L (cid:11) A,W t ,X t , (cid:40) L = (cid:80) σµ γ σ ˆ x t +1 σµ x t +1 σµ L = N (cid:80) µν (cid:80) ij A ij W tνµ ˆ x t +1 σ i µ φ ( x tjν ) . , (12) A.1 Random averages over W t and A We first take the average of exp( L ) over W t . The Gaussian integral with respect to W t yields (cid:10) e L (cid:11) W t = exp DN (cid:88) µν (cid:88) ijk(cid:96) A ij ˆ x t +1 σ i µ φ ( x tjν ) A k(cid:96) ˆ x t +1 σ k µ φ ( x t(cid:96)ν ) = exp (cid:32) D (cid:88) σ σ σ σ (cid:88) µ ˆ x t +1 σ µ ˆ x t +1 σ µ (cid:88) ν ψ t,νσ σ ψ t,νσ σ (cid:33) = exp (cid:32) D (cid:88) σ σ u t +1 σ σ v tσ σ (cid:33) , (13)where we introduce the following quantities: ψ t,νσσ (cid:48) ≡ N (cid:88) i ∈ V σ (cid:88) j ∈ V σ (cid:48) A ij φ ( x tjν ) , (14) u t +1 σσ (cid:48) ≡ D (cid:88) µ ˆ x t +1 σµ ˆ x t +1 σ (cid:48) µ , (15) v tσσ (cid:48) ≡ D (cid:88) µ (cid:88) ˜ σ ˜ σ (cid:48) ψ t,µσ ˜ σ ψ t,µσ (cid:48) ˜ σ (cid:48) . (16)Therefore, Eq. (12) can be written as (cid:90) D ˆ x t +1 D x t +1 (cid:42)(cid:90) D ˆ u t +1 D u t +1 (cid:90) D ˆ v t D v t (cid:90) D ˆ ψ t D ψ t × (cid:28) exp (cid:34) −L + D (cid:88) σσ (cid:48) u t +1 σσ (cid:48) v tσσ (cid:48) − (cid:88) σσ (cid:48) ˆ u t +1 σσ (cid:48) (cid:32) Du t +1 σσ (cid:48) − (cid:88) µ ˆ x t +1 σµ ˆ x t +1 σ (cid:48) µ (cid:33) − (cid:88) σσ (cid:48) ˆ v tσσ (cid:48) (cid:32) Dv tσσ (cid:48) − (cid:88) ν (cid:88) ˜ σ ˜ σ (cid:48) ψ t,νσ ˜ σ ψ t,νσ (cid:48) ˜ σ (cid:48) (cid:33) − (cid:88) σσ (cid:48) (cid:88) µ ˆ ψ t,µσσ (cid:48) ψ t,µσσ (cid:48) − N (cid:88) i ∈ V σ (cid:88) j ∈ V σ (cid:48) A ij φ ( x tjν ) (cid:35)(cid:29) A (cid:43) X t . (17)Note that as we will see below, u t +1 , v t , ψ t , and their conjugates are related to X t , and thus theaverage over X t is taken outside of their integral.We next take the average over a random graph. In Eq. (17), only the final term in the exponent isrelevant to A . We denote this term as L . We also let Ξ tij = (cid:80) µ (cid:16) ˆ ψ t,µσ i σ j φ ( x tjµ ) + ˆ ψ t,µσ j σ i φ ( x tiµ ) (cid:17) /N .12ecause the graph is generated from the SBM, we have that (cid:10) e L (cid:11) A = (cid:89) i Here, we compare Eq. (24) with a Markovian discrete-time stochastic process y t +1 σ = η tσ , in whicheach element is correlated via a random noise, i.e., (cid:104) η tσ (cid:105) η = 0 , and (cid:104) η tσ η tσ (cid:48) (cid:105) η = C σσ (cid:48) for any t . The13orresponding normalization condition reads (cid:90) (cid:89) σ dy t +1 σ (cid:42)(cid:89) σ δ (cid:0) y t +1 σ − η tσ (cid:1)(cid:43) η = (cid:90) (cid:89) σ γ σ d ˆ y t +1 σ dy t +1 σ πi (cid:68) e − (cid:80) σ γ σ ˆ y t +1 σ ( y t +1 σ − η tσ ) (cid:69) η = (cid:90) D ˆ y t +1 D y t +1 e − (cid:80) σ γ σ ˆ y t +1 σ y t +1 σ (cid:90) (cid:89) σ dη tσ πi exp (cid:32) − (cid:88) σσ (cid:48) η tσ C − σσ (cid:48) η tσ (cid:48) + (cid:88) σ γ σ ˆ y t +1 σ η tσ (cid:33) = (cid:90) D ˆ y t +1 D y t +1 exp (cid:32) − (cid:88) σ γ σ ˆ y t +1 σ y t +1 σ + 12 (cid:88) σσ (cid:48) ˆ y t +1 σ γ σ C σσ (cid:48) γ σ (cid:48) ˆ y t +1 σ (cid:48) (cid:33) . (26)Analogously to the case of the GNN, we have defined D ˆ y t +1 D y t +1 ≡ (cid:81) σ γ σ d ˆ y t +1 σ dy t +1 σ / πi . A.3 Self-consistent equation Finally, we compare Eqs. (24) and (26). However, note that these are not of exactly the same form,because the average over X t is taken outside of the exponential in Eq. (24). Two approximationsare made in order to derive the self-consistent equation, and the assumptions that justify theseapproximations are discussed afterward.First, if the approximation (cid:42) exp (cid:32) (cid:88) σσ (cid:48) ˆ x t +1 σ F σσ (cid:48) (cid:0) X t (cid:1) ˆ x t +1 σ (cid:48) (cid:33)(cid:43) X t ≈ exp (cid:32) (cid:88) σσ (cid:48) ˆ x t +1 σ (cid:10) F σσ (cid:48) (cid:0) X t (cid:1)(cid:11) X t ˆ x t +1 σ (cid:48) (cid:33) (27)holds in the stationary limit, then the group-wise state x t can be regarded as a Gaussian variablewhose correlation matrix obeys C σσ (cid:48) = 1 γ σ γ σ (cid:48) (cid:88) σ σ B σ ˜ σ B σ (cid:48) ˜ σ (cid:48) (cid:42) γ ˜ σ N γ ˜ σ (cid:48) N (cid:88) i ∈ V ˜ σ (cid:88) j ∈ V ˜ σ (cid:48) φ ( x i ) φ ( x j ) (cid:43) X t . (28)This equation is still not closed, because the right-hand side of Eq. (28) depends on the statistic of X t , rather than x t . However, because the vertices within the group σ are statistically equivalent, { x i } i ∈ V σ are expected to obey the same distribution with mean x σ , which itself is a random variable.If (cid:80) i ∈ V σ φ ( x i ) / ( γ σ N ) ≈ φ ( x σ ) holds, then the right-hand side of Eq. (28) can be evaluated as theaverage with respect to the group-wise variable x t . Then, within this regime we arrive at the followingself-consistent equation with respect to the covariance matrix C = [ C σσ (cid:48) ] : C σσ (cid:48) = 1 γ σ γ σ (cid:48) (cid:88) ˜ σ ˜ σ (cid:48) B σ ˜ σ B σ (cid:48) ˜ σ (cid:48) (cid:90) d x e − x (cid:62) C − x (2 π ) N √ det C φ ( x ˜ σ ) φ ( x ˜ σ (cid:48) ) . (29)Let us consider the first approximation that we adopted in Eq. (27). In the terminology of physics, thisis the replacement of a free energy with an internal energy, or the neglect of the entropic contribution.It is difficult to evaluate this residual in general. However, note that this becomes closer to equality asevery x i approaches the same value. Therefore, this implies that the self-consistent equation is moreaccurate as we approach the detectability limit, and yields an adequate estimate of the critical value.Let us next consider the second approximation we adopted in Eq. (28). Although the law oflarge numbers with respect to φ ( x i ) (not x i ) ensures that (cid:80) i ∈ V σ φ ( x i ) / ( γ σ N ) has a certain valuecharacterized by the group, this may be different from φ ( x σ ) . In fact, the relation between these is ingeneral an inequality (Jensen’s inequality) when the activation function φ is a convex function. The(exact) equality holds only when { x i } is constant or the function φ is linear within the considereddomain.The second approximation can be justified in the following cases. The first case is when the fluctuationof x i − x σ i is negligible compared to the magnitude of x σ . Note that this is the same assumptionas we made in the first approximation. To see this precisely, let us express x i as x i = x σ + z i for i ∈ V σ . We can formally write the probability distribution P ( { x i } ) of { x i } in a hierarchical fashionas follows: P ( { x i } ) = (cid:90) (cid:89) σ d x σ (cid:90) (cid:89) i dz i P σ ( { x σ } ) P ( { x i } ) (cid:89) i δ ( z i − x i + x σ i ) , (30)14 .1 0.2 0.3 0.4 0.50.50.60.70.80.91.0 O v e r l a p N c Undetectable Detectable O v e r l a p (a) (b) Figure 5: Performance of the untrained GNN using the k-means classifier with φ ( X ) . (a) Thedetectability phase diagram and (b) the overlaps of the SBM with c = 8 are plotted in the samemanner as in Fig. 3 in the main text. When the variation of the overlap is interpolated for each graphsize, these curves are crossed at (cid:15) ∗ ≈ . . It implies the presence of detectability phase transitionaround the value of (cid:15) predicted by our mean-field estimate.where P σ ( { x σ } ) is the probability distribution with respect to x . Thus, the expectation (cid:104) f ( X ) (cid:105) X can be expressed as (cid:104) f ( X ) (cid:105) X ≡ (cid:90) (cid:89) i dx i P ( { x i } ) f ( { x i } )= (cid:90) (cid:89) σ d x σ (cid:90) (cid:89) i dz i P ( { z i }|{ x σ } ) P σ ( { x σ } ) f ( { x σ i + z i } ) , (31)where P ( { z i }|{ x σ } ) ≡ P ( { x σ i + z i } ) , which can be a nontrivial function. However, whenever thecontributions from the average with respect to z i are negligible, Eq. (31) implies that the expectationin Eq. (28) can be evaluated using only the group-wise variables { x σ } . Another case is when theactivation function φ is almost linear within the domain over which z i fluctuates. For example, in thecase that φ = tanh , the present approximation does not deteriorate the accuracy even when x σ ≈ .When either of these assumption holds, the equality of Jensen’s inequality is approximately satisfied,and our derivation of the self-consistent equation is justified. B K-means classification using φ ( X ) Instead of X T , φ ( X T ) can be adopted to perform the k-means classification after the feedforwardprocess. Again, we employ tanh as the nonlinear activation function. The results of an untrainedGNN and a trained GNN under the same experimental settings as in the main text are illustratedin Fig. 5 and Fig. 6, respectively. In Fig. 5a, the reader should note that the range of the colorgradient is different from that in the phase diagram in the main text. For the untrained GNN, theobtained overlaps are clearly better than that using X T . It can be understood that the error is reducedbecause the nonlinear function drives each element of the state X T to either +1 or − , making theclassification using the k-means method easier and more accurate. On the other hand, for the trainedGNN, differences between the overlaps using X T and φ ( X T ) are hardly observable.Particularly for the case of an untrained GNN in which φ ( X T ) is adopted for the readout classifier,the overlap gradually changes around the estimated detectability limit. This may be as result of thestrong finite-size effect. Again, note that our estimate of the detectability limit is for the case that N → ∞ . 15 .2 0.40.50.60.70.80.91.0 O v e r l a p N O v e r l a p T (a) (b) Figure 6: Performance of the trained GNN using the classifier with φ ( X ))