[PDF] On the equivalence between graph isomorphism testing and function approximation with GNNs

Abstract

Graph neural networks (GNNs) have achieved lots of success on graph-structured data. In the light of this, there has been increasing interest in studying their representation power. One line of work focuses on the universal approximation of permutation-invariant functions by certain classes of GNNs, and another demonstrates the limitation of GNNs via graph isomorphism tests. Our work connects these two perspectives and proves their equivalence. We further develop a framework of the representation power of GNNs with the language of sigma-algebra, which incorporates both viewpoints. Using this framework, we compare the expressive power of different classes of GNNs as well as other methods on graphs. In particular, we prove that order-2 Graph G-invariant networks fail to distinguish non-isomorphic regular graphs with the same degree. We then extend them to a new architecture, Ring-GNNs, which succeeds on distinguishing these graphs and provides improvements on real-world social network datasets.

Full PDF

OOn the equivalence between graph isomorphismtesting and function approximation with GNNs

Zhengdao Chen

Courant Institute of Mathematical SciencesNew York University [email protected]

Soledad Villar

Courant Institute of Mathematical SciencesCenter for Data ScienceNew York University [email protected]

Lei Chen

Courant Institute of Mathematical SciencesNew York University [email protected]

Joan Bruna

Courant Institute of Mathematical SciencesCenter for Data ScienceNew York University [email protected]

Abstract

Graph neural networks (GNNs) have achieved lots of success on graph-structureddata. In the light of this, there has been increasing interest in studying their rep-resentation power. One line of work focuses on the universal approximation ofpermutation-invariant functions by certain classes of GNNs, and another demon-strates the limitation of GNNs via graph isomorphism tests.Our work connects these two perspectives and proves their equivalence. We furtherdevelop a framework of the representation power of GNNs with the languageof sigma-algebra, which incorporates both viewpoints. Using this framework,we compare the expressive power of different classes of GNNs as well as othermethods on graphs. In particular, we prove that order-2 Graph G-invariant networksfail to distinguish non-isomorphic regular graphs with the same degree. We thenextend them to a new architecture, Ring-GNNs, which succeeds on distinguishingthese graphs as well as for social network datasets.

Graph structured data naturally occur in many areas of knowledge, including computational biology,chemistry and social sciences. Graph neural networks, in all their forms, yield useful representationsof graph data partly because they take into consideration the intrinsic symmetries of graphs, such asinvariance and equivariance with respect to a relabeling of the nodes [25, 7, 14, 8, 10, 26, 3].All these different architectures are proposed with different purposes (see [29] for a survey andreferences therein), and a priori it is not obvious how to compare their power. The recent work [30]proposes to study the representation power of GNNs via their performance on graph isomorphismtests. They developed the Graph Isomorphism Networks (GINs) that are as powerful as the one-dimensional Weisfeiler-Lehman (1-WL or just WL) test for graph isomorphism [28], and showedthat no other neighborhood-aggregating (or message passing) GNN can be more powerful than the1-WL test. Variants of message passing GNNs include [25, 9].On the other hand, for feed-forward neural networks, many results have been obtained regardingtheir ability to approximate continuous functions, commonly known as the universal approximationtheorems, such as the seminal works of [6, 12]. Following this line of work, it is natural to study

Preprint. Under review. a r X i v : . [ c s . L G ] M a y he expressivity of graph neural networks in terms of function approximation. Since we could arguethat many if not most functions on a graph that we are interested in are invariant or equivariant topermutations of the nodes in the graph, GNNs are usually designed to be invariant or equivariant,and therefore the natural question is whether certain classes GNNs can approximate any continuousand invariant or equivariant functions. Recent work [18] showed the universal approximation of G -invariant networks, constructed based on the linear invariant and equivariant layers studied in[17], if the order of the tensor involved in the networks can grow as the graph gets larger. Such adependence on the graph size was been theoretically overcame by the very recent work [13], thoughthere is no known upper bound on the order of the tensors involved. With potentially very-high-ordertensors, these models that are guaranteed of univeral approximation are not quite feasible in practice.The foundational part of this work aims at building the bridge between graph isomorphism testingand invariant function approximation, the two main perspectives for studying the expressive powerof graph neural networks. We demonstrate an equivalence between the the ability of a class ofGNNs to distinguish between any pairs of non-isomorphic graph and its power of approximatingany (continuous) invariant functions, for both the case with ﬁnite feature space and the case withcontinuous feature space. Furthermore, we argue that the concept of sigma-algebras on the spaceof graphs is a natural description of the power of graph neural networks, allowing us to builda taxonomy of GNNs based on how their respective sigmas-algebras interact. Building on thistheoretical framework, we identify an opportunity to increase the expressive power of order- G -invariant networks with computational tractability, by considering a ring of invariant matrices underaddition and multiplication. We show that the resulting model, which we refer to as Ring-GNN ,is able to distinguish between non-isomorphic regular graphs where order- G -invariant networksprovably fail. We illustrate these gains numerically in synthetic and real graph classiﬁcation tasks.Summary of main contributions: • We show the equivalence between graph isomorphism testing and approximation ofpermutation-invariant functions for analyzing the expressive power of graph neural networks. • We introduce a language of sigma algebra for studying the representation power of graphneural networks, which uniﬁes both graph isomorphism testing and function approximation,and use this framework to compare the power of some GNNs and other methods. • We propose Ring-GNN, a tractable extension of order-2 Graph G -invariant Networks thatuses the ring of matrix addition and multiplication. We show this extension is necessary andsufﬁcient to distinguish Circular Skip Links graphs. Graph Neural Networks and graph isomorphism.

Graph isomorphism is a fundamental problemin theoretical computer science. It amounts to deciding, given two graphs

A, B , whether there exists apermutation π such that πA = Bπ . There exists no known polynomial-time algorithm to solve it, butrecently Babai made a breakthrough by showing that it can be solved in quasi-polynomial-time [1].Recently [30] introduced graph isomorphism tests as a characterization of the power of graph neuralnetworks. They show that if a GNN follows a neighborhood aggregation scheme, then it cannotdistinguish pairs of non-isomorphic graphs that the 1-WL test fails to distinguish. Therefore thisclass of GNNs is at most as powerful as the 1-WL test. They further propose the Graph IsomorphismNetworks (GINs) based on approximating injective set functions by multi-layer perceptrons (MLPs),which can be as powerful as the 1-WL test. Based on k -WL tests [4], [19] proposes k -GNN, whichcan take higher-order interactions among nodes into account. Concurrently to this work, [16] provesthat order- k invariant graph networks are at least as powerful as the k -WL tests, and similarly to us, itand augments order-2 networks with matrix multiplication. They show they achieve at least the powerof 3-WL test. [20] proposes relational pooling (RP), an approach that combines permutation-sensitive functions under all permutations to obtain a permutation-invariant function. If RP is combined withpermutation-sensitive functions that are sufﬁciently expressive, then it can be shown to be a universalapproximator. A combination of RP and GINs is able to distinguish certain non-isomorphic regulargraphs which GIN alone would fail on. A drawback of RP is that its full version is intractablecomputationally, and therefore it needs to be approximated by averaging over randomly sampledpermutations, in which case the resulting functions is not guaranteed to be permutation-invariant.2 niversal approximation of functions with symmetry. Many works have discussed the func-tion approximation capabilities of neural networks that satisfy certain symmetries. [2] studies thesymmetry in neural networks from the perspective of probabilistic symmetry and characterizes thedeterministic and stochastic neural networks that satisfy certain symmetry. [24] shows that equivari-ance of a neural network corresponds to symmetries in its parameter-sharing scheme. [31] proposes aneural network architecture with polynomial layers that is able to achieve universal approximationof invariant or equivariant functions. [17] studies the spaces of all invariant and equivariant linearfunctions, and obtained bases for such spaces. Building upon this work, [18] proposes the G -invariantnetwork for a symmetry group G , which achieves universal approximation of G -invariant functionsif the maximal tensor order involved in the network to grow as n ( n − , but such high-order tensorsare prohibitive in practice. Upper bounds on the approximation power of the G -invariant networkswhen the tensor order is limited remains open except for when G = A n [18]. The very recentwork [13] extends the result to the equivariant case, although it suffers from the same problem ofpossibly requiring high-order tensors. Within the computer vision literature, this problem has alsobeen addressed, in particular [11] proposes an architecture that can potentially express all equivariantfunctions.To the best our knowledge, this is the ﬁrst work that shows an explicit connection between the twoaforementioned perspectives of studying the representation power of graph neural networks - graphisomorphism testing and universal approximation. Our main theoretical contribution lies in showingan equivalence between them, for both ﬁnite and continuous feature space cases, with a naturalgeneralization of the notion of graph isomorphism testing to the latter case. Then we focus on theGraph G -invariant network based on [17, 18], and showed that when the maximum tensor orderis restricted to be 2, then it cannot distinguish between non-isomorphic regular graphs with equaldegrees. As a corollary, such networks are not universal. Note that our result shows an upper boundon order 2 G -invariant networks, whereas concurrently to us, [16] provides a lower bound by relatingto k -WL tests. Concurrently to [16], we propose a modiﬁed version of order-2 graph networks tocapture higher-order interactions among nodes without computing tensors of higher-order. In this section we show that there exists a very close connection between the universal approximationof permutation-invariant functions by a class of functions, and its ability to perform graph isomor-phism tests. We consider graphs with nodes and edges labeled by elements of a compact set

X ⊂ R .We represent graphs with n nodes by an n by n matrix G ∈ X n × n , where a diagonal term G ii represents the label of the i th node, and a non-diagonal G ij represents the label of the edge from the i th node to the j th node. An undirected graph will then be represented by a symmetric G .Thus, we focus on analyzing a collection C of functions from X n × n to R . We are especiallyinterested in collections of permutation-invariant functions , deﬁned so that f ( π (cid:124) Gπ ) = f ( G ) , forall G ∈ X n × n , and all π ∈ S n , where S n is the permutation group of n elements. For classes offunctions, we deﬁne the property of being able to discriminate non-isomorphic graphs, which we call GIso-discriminating , which as we will see generalizes naturally to the continuous case.

Deﬁnition 1.

Let C be a collection of permutation-invariant functions from X n × n to R . We say C is GIso-discriminating if for all non-isomorphic G , G ∈ X n × n (denoted G (cid:54)(cid:39) G ), there exists afunction h ∈ C such that h ( G ) (cid:54) = h ( G ) . This deﬁnition is illustrated by ﬁgure 2 in the appendix. Deﬁnition 2.

Let C be a collection of permutation-invariant functions from X n × n to R . We say C is universally approximating if for all permutation-invariant function f from X n × n to R , and for all (cid:15) > , there exists h f,(cid:15) ∈ C such that (cid:107) f − h f,(cid:15) (cid:107) ∞ := sup G ∈X n × n | f ( G ) − h ( G ) | < (cid:15) As a warm-up we ﬁrst consider the space of graphs with a ﬁnite set of possible features for nodes andedges, X = { , . . . , M } . Theorem 1.

Universally approximating classes of functions are also GIso-discriminating.Proof.

Given G , G ∈ X n × n , we consider the permutation-invariant function (cid:39) G : X n × n → R such that (cid:39) G ( G ) = 1 if G is isomorphic to G and 0 otherwise. Therefore, it can be approximated3ith (cid:15) = 0 . by a function h ∈ C . Then h is a function that distinguishes G from G , as inDeﬁnition 1. Hence C is GIso-discriminating.To obtain a result on the reverse direction, we ﬁrst introduce the concept of an augmented collectionof functions, which is especially natural when C is a collection of neural networks. Deﬁnition 3.

Given C , a collection of functions from X n × n to R , we consider an augmentedcollection of functions also from X n × n to R consisting of functions that map an input graph G to N N ([ h ( G ) , ..., h d ( G )]) for some ﬁnite d , where N N is a feed-forward neural network / multi-layerperceptron, and h , ..., h d ∈ C . When N N is restricted to have L layers, we denoted this augmentedcollection by C + L . In this work, we consider ReLU as the nonlinear activation function in the neuralnetworks. Remark 1. If C L is the collection of feed-forward neural networks with L layers, then C + LL represents the collection of feed-forward neural networks with L + L layers. Remark 2. If C is a collection of permutation-invariant functions, so is C + L . Theorem 2. If C is GIso-discriminating, then C +2 is universal approximating. The proof is simple and it is a consequence of the following lemmas that we prove in Appendix A.

Lemma 1. If C is GIso-discriminating, then for all G ∈ X n × n , there exists a function ˜ h G ∈ C +1 such that for all G (cid:48) , ˜ h G ( G (cid:48) ) = 0 if and only if G (cid:39) G (cid:48) . Lemma 2.

Let C be a class of permutation-invariant functions from X n × n to R satisfying theconsequences of Lemma 1, then C +1 is universally approximating. Graph isomorphism is an inherently discrete problem, whereas universal approximation is usuallymore interesting when the input space is continuous. With our deﬁnition 1 of

GIso-discriminating ,we can achieve a natural generalization of the above results to the scenarios of continuous input space.All proofs for this section can be found in Appendix A.Let X be a compact subset of R , and we consider graphs with n nodes represented by G ∈ K = X n × n ; that is, the node features are { G ii } i =1 ,...,n and the edge features are { G ij } i,j =1 ,...,n ; i (cid:54) = j . Theorem 3. If C is universally approximating, then it is also GIso-discriminating The essence of the proof is similar to that of Theorem 1. The other direction - showing that pairwisediscrimination can lead to universal approximation - is less straightforward. As an intermediate stepbetween, we make the following deﬁnition:

Deﬁnition 4.

Let C be a class of functions K → R . We say it is able to locate every isomorphismclass if for all G ∈ K and for all (cid:15) > there exists h G ∈ C such that: • for all G (cid:48) ∈ K, h G ( G (cid:48) ) ≥ ; • for all G (cid:48) ∈ K , if G (cid:48) (cid:39) G , then h G ( G (cid:48) ) = 0 ; and • there exists δ G > such that if h G < δ G , then ∃ π ∈ S n such that d ( π ( G (cid:48) ) , G ) < (cid:15) , where d is the Euclidean distance deﬁned on R n × n Lemma 3. If C , a collection of continuous permutation-invariant functions from K to R , is GIso-discriminating, then C +1 is able to locate every isomorphism class. Heuristically, we can think of the h G in the deﬁnition above as a “loss function” that penalizes thedeviation of G (cid:48) from the equivalence class of G . In particular, the second condition says that if theloss value is small enough, then we know that G (cid:48) has to be close to the equivalence class of G . Lemma 4.

Let C be a class of permutation-invariant functions K → R . If C is able to locate everyisomorphism class, then C +2 is universally approximating. Combining the two lemmas above, we arrive at the following theorem:

Theorem 4. If C , a collection of continuous permutation-invariant functions from K to R , is GIso-discriminating, then C +3 is universaly approximating. A framework of representation power based on sigma-algebra

Let K = X n × n be a ﬁnite input space. Let Q K := K/ (cid:39) be the set of isomorphism classes underthe equivalence relation of graph isomorphism. That is, for all τ ∈ Q K , τ = { π (cid:124) Gπ : π ∈ Γ n } forsome G ∈ K .Intuitively, a maximally expressive collection of permutation-invariant functions, C , will allow us toknow exactly which isomorphism class τ a given graph G belongs to, by looking at the outputs ofcertain functions in the collection applied to G . Heuristically, we can consider each function in C as a“measurement”, which partitions that graph space K according to the function value at each point. If C is powerful enough, then as a collection it will partition K to be as ﬁne as Q K . If not, it is going tobe coarser than Q K . These intuitions motivate us to introduce the language of sigma-algebra.Recall that an algebra on a set K is a collection of subsets of K that includes K itself, is closedunder complement, and is closed under ﬁnite union. Because K is ﬁnite, we have that an algebraon K is also a sigma-algebra on K , where a sigma-algebra further satisﬁes the condition of beingclosed under countable unions. Since Q K is a set of (non-intersecting) subsets of K , we can obtainthe algebra generated by Q K , deﬁned as the smallest algebra that contains Q K , and use σ ( Q K ) todenote the algebra (and sigma-algebra) generated by Q K . Observation 1. If f : X n × n → R is a permutation-invariant function, then f is measurable withrespect to σ ( Q K ) , and we denote this by f ∈ M [ σ ( Q K )] Now consider a class of functions C that is permutation-invariant. Then for all f ∈ C , f ∈ M [ σ ( Q K )] .We deﬁne the sigma-algebra generated by f as the set of all the pre-images of Borel sets on R under f , and denote it by σ ( f ) . It is the smallest sigma-algebra on K that makes f measurable. For aclass of functions C , σ ( C ) is deﬁned as the smallest sigma-algebra on K that makes all functionsin C measurable. Because here we assume K is ﬁnite, it does not matter whether C is a countablecollection. We restrict our attention to the case of ﬁnite feature space. Given a graph G ∈ X n × n , we use E ( G ) todenote its isomorphism class, { G (cid:48) ∈ X n × n : G (cid:48) (cid:39) G } . The following results are proven in Section B Theorem 5. If C is a class of permutation-invariant functions on X n × n and C is GIso-discriminating,then σ ( C ) = σ ( Q K ) Together with Theorem 1, the following is an immediate consequence:

Corollary 1. If C is a class of permutation-invariant functions on X n × n and C achieves universalapproximation, then σ ( C ) = σ ( Q K ) . Theorem 6.

Let be C a class of permutation-invariant functions on X n × n with σ ( C ) = σ ( Q K ) .Then C is GIso-discriminating. Thus, this sigma-algebra language is a natural notion for characterizing the power of graph neuralnetworks, because as shown above, generating the ﬁnest sigma-algebra σ ( Q K ) is equivalent to beingGIso-discriminating, and therefore to universal approximation.Moreover, when C is not GIso-discriminating or universal, we can evaluate its representation powerby studying σ ( C ) , which gives a measure for comparing the power of different GNN families. Giventwo classes of functions C , C , there is σ ( C ) ⊆ σ ( C ) if and only if M [ σ ( C )] ⊆ M [ σ ( C )] if andonly if C is less powerful than C in terms of representation power.In Appendix C we use this notion to compare the expressive power of different families of GNNs aswell as other algorithms like 1-WL, linear programming and semideﬁnite programming in terms oftheir ability to distinguish non-isomorphic graphs. We summarize our ﬁndings in Figure 1.5 GNN ( I, A ) LP ≡ − W L ≡ GIN

SDP MPNN ∗ sGNN ( I, D, A, { min { A t , }} Tt =1 ) order 2 G -invariant networks ∗ spectral methodsSoS hierarchy Ring-GNN Figure 1:

Relative comparison of function classes in terms of their ability to solve graph isomorphism. ∗ Note that, on one hand GIN is deﬁned by [30] as a form of message passing neural network justifying theinclusion GIN (cid:44) → MPNN. On the other hand [17] shows that message passing neural networks can be expressedas a modiﬁed form of order 2 G -invariant networks (which may not coincide with the deﬁnition we consider inthis paper). Therefore the inclusion GIN (cid:44) → order 2 G -invariant networks has yet to be established rigorously. We now investigate the G -invariant network framework proposed in [18] (see Appendix D for itsdeﬁnition and a description of an adapted version that works on graph-structured inputs, whichwe call the Graph G -invariant Networks ). The architecture of G -invariant networks is built byinterleaving compositions of equivariant linear layers between tensors of potentially different ordersand point-wise nonlinear activation functions. It is a powerful framework that can achieve universalapproximation if the order of the tensor can grow as n ( n − , where n is the number of nodes in thegraph, but less is known about its approximation power when the tensor order is restricted. Oneparticularly interesting subclass of G -invariant networks is the ones with maximum tensor order 2,because [17] shows that it can approximate any Message Passing Neural Network [8]. Moreover, it isboth mathematically cumbersome and computationally expensive to include equivariant linear layersinvolving tensors with order higher than 2.Our following result shows that the order-2 Graph G -invariant Networks subclass of functions isquite restrictive. The proof is given in Appendix D. Theorem 7.

Order-2 Graph G -invariant Networks cannot distinguish between non-isomorphicregular graphs with the same degree. Motivated by this limitation, we propose a GNN architecture that extends the family of order-2 Graph G -invariant Networks without going into higher order tensors. In particular, we want the new familyto include GNNs that can distinguish some pairs of non-isomorphic regular graphs with the samedegree. For instance, take the pair of Circular Skip Link graphs G , and G , , illustrated in Figure 5.Roughly speaking, if all the nodes in both graphs have the same node feature, then because they allhave the same degree, the updates of node states in both graph neural networks based on neighborhoodaggregation and the WL test will fail to distinguish the nodes. However, the power graphs of G , and G , have different degrees. Another important example comes from spectral methods thatoperate on normalized operators, such as the normalized Laplacian ∆ = I − D − / AD − / , where D is the diagonal degree operator. Such normalization preserves the permutation symmetries and inmany clustering applications leads to dramatic improvements [27].This motivates us to consider a polynomial ring generated by the matrices that are the outputs ofpermutation-equivariant linear layers, rather than just the linear space of those outputs. Togetherwith point-wise nonlinear activation functions such as ReLU, power graph adjacency matrices like min( A , can be expressed with suitable choices of parameters. We call the resulting architecturethe Ring-GNN . If A is the adjacency matrix of a graph, its power graph has adjacency matrix min( A , . The matrix min( A , has been used in [5] in graph neural networks for community detection and in [21] for the quadraticassignment problem. We call it Ring-GNN since the main object we consider is the ring of matrices, but technically we canexpress an associative algebra since our model includes scalar multiplications. G n,k are undirected graphs in n nodes q , . . . , q n − so that ( i, j ) ∈ E if and only if | i − j | ≡ or k (mod n ) . In this ﬁgure we depict (left) G , and (right) G , . It is very easy to check that G n,k and G n (cid:48) ,k (cid:48) are not isomorphic unless n = n (cid:48) and k ≡ ± k (cid:48) (mod n ) . Both 1-WL and G -invariant networks fail to distinguish them. Deﬁnition 5 (Ring-GNN) . Given a graph in n nodes with both node and edge features in R d , we represent it with a matrix A ∈ R n × n × d . [17] shows that all linear equivariant lay-ers from R n × n to R n × n can be expressed as L θ ( A ) = (cid:80) i =1 θ i L i ( A ) + (cid:80) i =16 θ i L i , wherethe { L i } i =1 ,..., are the 15 basis functions of all linear equivariant functions from R n × n to R n × n , L and L are the basis for the bias terms, and θ ∈ R are the parameters thatdetermine L . Generalizing to an equivariant linear layer from R n × n × d to R n × n × d (cid:48) , we set L θ ( A ) · , · ,k (cid:48) = (cid:80) dk =1 (cid:80) i =1 θ k,k (cid:48) ,i L i ( A · , · ,i ) + (cid:80) i =16 θ k,k (cid:48) ,i L i , with θ ∈ R d × d (cid:48) × .With this formulation, we now deﬁne a Ring-GNN with T layers. First, set A (0) = A . In the t th layer,let B ( t )1 = ρ ( L α ( t ) ( A ( t ) )) B ( t )2 = ρ ( L β ( t ) ( A ( t ) ) · L γ ( t ) ( A ( t ) )) A ( t +1) = k ( t )1 B ( t )1 + k ( t )2 B ( t )2 where k ( t )1 , k ( t )2 ∈ R , α ( t ) , β ( t ) , γ ( t ) ∈ R d ( t ) × d (cid:48) ( t ) × are learnable parameters. If a scalar out-put is desired, then in the general form, we set the output to be θ S (cid:80) i,j A ( T ) ij + θ D (cid:80) i,i A ( T ) ii + (cid:80) i θ i λ i ( A ( T ) ) , where θ S , θ D , θ , . . . , θ n ∈ R are trainable parameters, and λ i ( A ( T ) ) is the i -theigenvalue of A ( L ) . Note that each layer is equivariant, and the map from A to the ﬁnal scalar output is invariant. ARing-GNN can reduce to an order-2 Graph G -invariant Network if k ( t )2 = 0 . With J + 1 layersand suitable choices of the parameters, it is possible to obtain min( A J , in the ( J + 1) th layer.Therefore, we expect it to succeed in distinguishing certain pairs of regular graphs that order-2 Graph G -invariant Networks fail on, such as the Circular Skip Link graphs. Indeed, this is veriﬁed in thesynthetic experiment presented in the next section. The normalized Laplacian can also be obtained,since the degree matrix can be inverted by taking the reciprocal on the diagonal, and then entry-wiseinversion and square root on the diagonal can be approximated by MLPs.The terms in the output layer involving eigenvalues are optional, depending on the task. For example,in community detection spectral information is commonly used [15]. We could also take a ﬁxednumber of eigenvalues instead of the full spectrum. In the experiments, Ring-GNN-SVD includesthe eigenvalue terms while Ring-GNN does not, as explained in appendix E. Computationally, thecomplexity of running the forward model grows as O ( n ) , dominated by matrix multiplications andpossibly singular value decomposition for computing the eigenvalues. We note also that Ring-GNNcan be augmented with matrix inverses or more generally with functional calculus on the spectrum ofany of the intermediate representations while keeping O ( n ) computational complexity. Finally,note that a Graph G -invariant Network with maximal tensor order d will have complexity at least O ( n d ) . Therefore, the Ring-GNN explores higher-order interactions in the graph that order-2 Graph G -invariant Networks neglects while remaining computationally tractable. The different models and the detailed setup of the experiments are discussed in Appendix E. When A = A (0) is an undirected graph, one easily veriﬁes that A ( t ) contains only symmetric matrices foreach t . .1 Classifying Circular Skip Links (CSL) graphs The following experiment on synthetic data demonstrates the connection between function ﬁttingand graph isomorphism testing. The Circular Skip Links graphs are undirected regular graphs withnode degree 4 [20], as illustrated in Figure 5. Note that two CSL graphs G n,k and G n (cid:48) ,k (cid:48) are notisomorphic unless n = n (cid:48) and k ≡ ± k (cid:48) (mod n ) . In the experiment, which has the same setup as in[20], we ﬁx n = 41 , and set k ∈ { , , , , , , , , , } , and each k corresponds to a distinctisomorphism class. The task is then to classify a graph G n,k by its skip length k .Note that since the 10 classes have the same size, a naive uniform classiﬁer would obtain . accuracy.As we see from Table 1, both GIN and G -invariant network with tensor order 2 do not outperformthe naive classiﬁer. Their failure in this task is unsurprising: WL tests are proved to fall short ofdistinguishing such pairs of non-isomorphic regular graphs [4], and hence neither can GIN [30]; bythe theoretical results from the previous section, order-2 Graph G -invariant network are unable todistinguish them either. Therefore, their failure as graph isomorphism tests is consistent with theirfailure in this classiﬁcation task, which can be understood as trying to approximate the function thatmaps the graph to their class labels.It should be noted that, since graph isomorphism tests are not entirely well-posed as classﬁcationtasks, the performance of GNN models could vary due to randomness. But the fact that Ring-GNNsachieve a relatively high maximum accuracy (compared to RP for example) demonstrates that as aclass of GNNs it is rich enough to contain functions that distinguish the CSL graphs to a large extent.Circular Skip Links IMDBB IMDBMGNN architecture max min std mean std mean stdRP-GIN † †

10 10 0 75.1 5.1 52.3 2.8Order 2 G-invariant †

10 10 0 71.27 4.5 48.55 3.9sGNN-5 80 80 0 72.8 3.8 49.4 3.2sGNN-2 30 30 0 73.1 5.2 49.0 2.1sGNN-1 10 10 0 72.7 4.9 49.0 2.1LGNN [5] 30 30 0 74.1 4.6 50.9 3.0Ring-GNN 80 10 15.7 73.0 5.4 48.2 2.7Ring-GNN-SVD 100 100 0 73.1 3.3 49.6 3.0Table 1: (left)

Accuracy of different GNNs at classifying CSL (see Section 6.1). We report the bestperformance and worst performance among 10 experiments. (right)

Accuracy of different GNNs atclassifying real datasets (see Section 6.1). We report the best performance among all epochs on a10-fold cross validation dataset, as was done in [30]. † : Reported performance by [20], [30] and [17]. We use the two IMDB datasets (IMDBBINARY, IMDBMULTI) to test different models in real-world scenarios. Since our focus is on distinguishing graph structures, these datasets are suitableas they do not contain node features, and hence the adjacency matrix contains all the input data.IMDBBINARY dataset has 1000 graphs, with average number of nodes 19.8 and 2 classes. Thedataset is randomly partitioned into 900/100 for training/validation. IMDBMULTI dataset has 1500graphs, with average number of nodes 13.0 and 3 classes. The dataset is randomly partitioned into1350/150 for training/validation. All models are evaluated via 10-fold cross validation and bestaccuracy is calculated through averaging across folds followed by maximizing along epochs [30].Importantly, the architecture hyper-parameter of Ring-GNN we use is close to that provided in [17]to show that order-2 G -invariant Network is included in model family we propose. The resultsshow that Ring-GNN models achieve higher performance than Order-2 G-invariant networks in bothdatasets. Admittedly its accuracy does not reach that of the state-of-the-art. However, the main goalof this part of our work is not necessarily to invent the best-performing GNN through hyperparameteroptimization, but rather to propose Ring-GNN as an augmented version of order-2 Graph G -invariantNetworks and show experimental results that support the theory.8 Conclusions

In this work we address the important question of organizing the fast-growing zoo of GNN architec-tures in terms of what functions they can and cannot represent. We follow the approach via the graphisomorphism test, and show that is equivalent to the other perspective via function approximation.We leverage our graph isomorphism reduction to augment order- G-invariant nets with the ring ofoperators associated with matrix multiplication, which gives provable gains in expressive power withcomplexity O ( n ) , and is amenable to efﬁciency gains by leveraging sparsity in the graphs.Our general framework leaves many interesting questions unresolved. First, a more comprehensiveanalysis on which elements of the algebra are really needed depending on the application. Next, ourcurrent GNN taxonomy is still incomplete, and in particular we believe it is important to furtherdiscern the abilities between spectral and neighborhood-aggregation-based architectures. Finally,and most importantly, our current notion of invariance (based on permutation symmetry) deﬁnes atopology in the space of graphs that is too strong; in other words, two graphs are either consideredequal (if they are isomorphic) or not. Extending the theory of symmetric universal approximation totake into account a weaker metric in the space of graphs, such as the Gromov-Hausdorff distance, is anatural next step, that will better reﬂect the stability requirements of powerful graph representationsto small graph perturbations in real-world applications. Acknowledgements

We would like to thank Haggai Maron for fruitful discussions and for pointingus towards G -invariant networks as powerful models to study representational power in graphs. Thiswork was partially supported by NSF grant RI-IIS 1816753, NSF CAREER CIF 1845360, the AlfredP. Sloan Fellowship, Samsung GRP and Samsung Electronics. SV was partially funded by EOARDFA9550-18-1-7007 and the Simons Collaboration Algorithms and Geometry. References [1] László Babai. Graph isomorphism in quasipolynomial time. In

Proceedings of the forty-eighthannual ACM symposium on Theory of Computing , pages 684–697. ACM, 2016.[2] Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic symmetry and invariant neuralnetworks. arXiv preprint arXiv:1901.06082 , 2019.[3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning:Going beyond euclidean data.

IEEE Signal Processing Magazine , 34(4):18–42, July 2017.[4] Jin-Yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number ofvariables for graph identiﬁcation.

Combinatorica , 12(4):389–410, 1992.[5] Zhengdao Chen, Lisha Li, and Joan Bruna. Supervised community detection with line graphneural networks.

Internation Conference on Learning Representations , 2019.[6] George Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics ofcontrol, signals and systems , 2(4):303–314, 1989.[7] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learningmolecular ﬁngerprints. In

Advances in neural information processing systems , pages 2224–2232, 2015.[8] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. In

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 1263–1272. JMLR. org, 2017.[9] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In

Advances in Neural Information Processing Systems , pages 1024–1034, 2017.[10] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methodsand applications. arXiv preprint arXiv:1709.05584 , 2017.[11] Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir Globerson. Mappingimages to scene graphs with permutation-invariant structured prediction. In

Advances in NeuralInformation Processing Systems , pages 7211–7221, 2018.[12] Kurt Hornik. Approximation capabilities of multilayer feedforward networks.

Neural Networks ,4:251–257, 1991. 913] Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. arXiv preprint arXiv:1905.04943 , 2019.[14] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[15] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zde-borová, and Pan Zhang. Spectral redemption in clustering sparse networks.

Proceedings of theNational Academy of Sciences , 110(52):20935–20940, 2013.[16] Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Lipman Yaron. Provably powerfulgraph networks. arXiv preprint arXiv:1905.11136 , 2019.[17] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariantgraph networks. 2018.[18] Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariantnetworks. arXiv preprint arXiv:1901.09342 , 2019.[19] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen,Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neuralnetworks.

Association for the Advancement of Artiﬁcial Intelligence , 2019.[20] Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relationalpooling for graph representations. arXiv preprint arXiv:1903.02541 , 2019.[21] Alex Nowak, Soledad Villar, Afonso S Bandeira, and Joan Bruna. A note on learning algorithmsfor quadratic assignment with graph neural networks. arXiv preprint arXiv:1706.07450 , 2017.[22] Ryan O’Donnell, John Wright, Chenggang Wu, and Yuan Zhou. Hardness of robust graphisomorphism, lasserre gaps, and asymmetry of random graphs. In

Proceedings of the twenty-ﬁfthannual ACM-SIAM symposium on Discrete algorithms , pages 1659–1677. Society for Industrialand Applied Mathematics, 2014.[23] Motakuri V Ramana, Edward R Scheinerman, and Daniel Ullman. Fractional isomorphism ofgraphs.

Discrete Mathematics , 132(1-3):247–265, 1994.[24] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivariance through parameter-sharing.

Proceedings of the 34th International Conference on Machine Learning , 2017.[25] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The graph neural network model.

IEEE Transactions on Neural Networks , 20(1):61–80, 2008.[26] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[27] Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 17(4):395–416,2007.[28] B Weisfeiler and A Leman. The reduction of a graph to canonical form and the algebra whichappears therein.

Nauchno-Technicheskaya Informatsia , 2(9):12-16, 1968.[29] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. Acomprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596 , 2019.[30] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neuralnetworks? arXiv preprint arXiv:1810.00826 , 2018.[31] Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXivpreprint arXiv:1804.10306 , 2018.[32] Qing Zhao, Stefan E Karisch, Franz Rendl, and Henry Wolkowicz. Semideﬁnite programmingrelaxations for the quadratic assignment problem.

Journal of Combinatorial Optimization ,2(1):71–109, 1998. 10

Proofs on universal approximation and graph isomorphism

Lemma 1. If C is GIso-discriminating, then for all G ∈ X n × n , there exists a function ˜ h G ∈ C +1 such that for all G (cid:48) , ˜ h G ( G (cid:48) ) = 0 if and only if G (cid:39) G (cid:48) . Proof of Lemma 1.

Given

G, G (cid:48) ∈ X n × n with G (cid:54)(cid:39) G (cid:48) , let h G,G (cid:48) ∈ C be the function that distin-guishes this pair, i.e. h G,G (cid:48) ( G ) (cid:54) = h G,G (cid:48) ( G (cid:48) ) . Then deﬁne a function h G,G (cid:48) by h G,G (cid:48) ( G ∗ ) = | h G,G (cid:48) ( G ∗ ) − h G,G (cid:48) ( G ) | = max( h G,G (cid:48) ( G ∗ ) − h G,G (cid:48) ( G ) ,

0) + max( h G,G (cid:48) ( G ) − h G,G (cid:48) ( G ∗ ) , (1)Note that if G ∗ (cid:39) G , then h G,G (cid:48) ( G ∗ ) = h G,G (cid:48) ( G ) , and so h G,G (cid:48) ( G ∗ ) = 0 . If G ∗ (cid:39) G (cid:48) , then h G,G (cid:48) ( G ∗ ) > . Otherwise, h G,G (cid:48) ( G ∗ ) ≥ .Next, deﬁne a function ˜ h G by ˜ h G ( G ∗ ) = (cid:80) G (cid:48) ∈X n × n ,G (cid:48) (cid:54)(cid:39) G h G,G (cid:48) ( G ∗ ) . If G ∗ (cid:39) G , we have ˜ h G ( G ∗ ) = 0 , whereas if G ∗ (cid:54)(cid:39) G then ˜ h G ( G ∗ ) > .Thus, it sufﬁces to show that ˜ h G ∈ C +1 . We take the ﬁnite subcollection of functions, { h G,G (cid:48) } G (cid:48) ∈X n × n ,G (cid:54)(cid:39) G (cid:48) , and feed the input graph G (cid:48) to each of them to obtain a vector of out-puts. By equation 1, h G,G (cid:48) ( G ∗ ) can be obtained from h G,G (cid:48) ( G ∗ ) by passing through one ReLU layer.Finally, a ﬁnite summation across G (cid:48) (cid:54)(cid:39) G yields ˜ h G ( G ∗ ) . Therefore, ˜ h G ∈ C +1 , ∀ G ∈ X n × n . Lemma 2

Let C be a class of permutation-invariant functions from X n × n to R so that for all G ∈ X n × n , there exists ˜ h G ∈ C satisfying ˜ h G ( G (cid:48) ) = 0 if and only if G (cid:39) G (cid:48) . Then C +1 isuniversally approximating. Proof of Lemma 2.

In fact, in the ﬁnite feature setting we can obtain a stronger result: for all f thatis permutation-invariant, f ∈ C +1 , and so no approximation is needed.We ﬁrst use the ˜ h G ’s to construct all the indicator functions G (cid:39) G ∗ as functions of G ∗ on X n × n . Toachieve this, because X n × n is ﬁnite, ∀ G , we let δ G = min G (cid:48) ∈X n × n ,G (cid:48) (cid:54)(cid:39) G | ˜ h G ( G (cid:48) ) | > . We thenintroduce a “bump” function from R to R with parameters a and b , ψ a,b ( x ) = ψ (( x − b ) /a ) , where ψ ( x ) = max( x − ,

0) + max( x + 1 , − x, . Then ψ a,b ( b ) = 0 , and supp ( ψ a,b ) = ( b − a, b + a ) . Now, we deﬁne a function ϕ G from X = { , ..., M } to R by ϕ G ( G ∗ ) = ψ δ G , (˜ h G ( G ∗ )) .Note that ϕ G ( G ∗ ) = G (cid:39) G ∗ as a function of G ∗ on X n × n .Given f , thanks to the ﬁniteness of the input space X n × n , we decompose it as f ( G ∗ ) =( | S n | (cid:80) G ∈X n × n G (cid:39) G ∗ ) f ( G ∗ ) = | S n | (cid:80) G ∈X n × n f ( G ) G (cid:39) G ∗ = | S n | (cid:80) G ∈X n × n f ( G ) ϕ G ( G ∗ ) .The right hand side can be realized in C +1 , since we can ﬁrst take the ﬁnite collection of functions { ˜ h G } G ∈X n × n and obtain { ˜ h G ( G ∗ ) } G ∈X n × n . Then, with an MLP with one hidden layer, we canobtain { ϕ G ( G ∗ ) } G ∈X n × n , a linear combination of which gives the right hand side, since each “ f ( G ) ”within the summation is a constant. Theorem 3 . If C is universally approximating, then it is also GIso-discriminating Proof of Theorem 3. ∀ G , G ∈ K , if G (cid:54)(cid:39) G , deﬁne f ( G ) = min π ∈ S n d ( G , π (cid:124) Gπ ) . It is acontinuous and permutation-invariant function on K , and therefore can be approximated by a function h ∈ C to within (cid:15) = f ( G ) > accuracy. Then h is a function that can discriminate between G and G . Lemma 3 . If C , a collection of continuous permutation-invariant functions from K to R , is pairwisedistinguishing, then C +1 is able to locate every isomorphism class. Proof of Lemma 3.

Fix any G ∈ K . ∀ G (cid:48) (cid:54)(cid:39) G ∈ K, ∃ h G,G (cid:48) ∈ C such that h G,G (cid:48) ( G ) (cid:54) = h G,G (cid:48) ( G (cid:48) ) . For each G (cid:48) , deﬁne a set A G (cid:48) as h − G,G (cid:48) (( h G,G (cid:48) ( G (cid:48) ) − | h G,G (cid:48) ( G (cid:48) ) − h G,G (cid:48) ( G ) | , h G,G (cid:48) ( G (cid:48) ) + dee Figure 3: Illustrating the deﬁnition of GIso-discriminating.

G, G (cid:48) and G (cid:48)(cid:48) are mutually non-isomorphic, and each of the big circles with dashed boundary represents an equivalence classunder graph isomorphism. h G,G (cid:48) is a permutation-invariant function that obtains different values onequivalence class of G and on that of G (cid:48) , and similar h G,G (cid:48)(cid:48) . If the graph space has only these threeequivalence classes of graphs, then C = { h G,G (cid:48) , h

G,G (cid:48)(cid:48) } is GIso-discriminating. | h G,G (cid:48) ( G (cid:48) ) − h G,G (cid:48) ( G ) | )) ⊆ K . Obviously G (cid:48) ∈ A G (cid:48) and G does not. Since h G,G (cid:48) is assumed continu-ous, A (cid:48) G is an open set for each G (cid:48) (cid:54)(cid:39) G . If G (cid:48) (cid:39) G , deﬁne A G (cid:48) = B ( G (cid:48) , (cid:15) ) , the open (cid:15) -ball in K under the Euclidean distance.Thus, { A G (cid:48) } G (cid:48) ∈ K is an open cover of K . Since K is compact, ∃ a ﬁnite subset K of K such that { A G (cid:48) } G (cid:48) ∈ K also covers K .Hence, ∀ G ∗ ∈ K, ∃ G (cid:48) ∈ K such that G ∗ ∈ A G (cid:48) . Moreover, ∀ G ∗ ∈ K \ ( (cid:83) G (cid:48) ∈E ( G ) A G (cid:48) ) = K \ ( (cid:83) π ∈ S n B ( π (cid:124) Gπ, (cid:15) )) , where E ( G ) represents the equivalence class of graphs in K consistingof graphs isomorphic to G , ∃ G (cid:48) ∈ K \ E ( G ) such that G ∗ ∈ A G (cid:48) .Now deﬁne a function ˜ h G on K by ˜ h G ( G ∗ ) = (cid:80) G (cid:48) ∈ K \E ( G ) h G,G (cid:48) ( G ∗ ) , where h G,G (cid:48) ( G ∗ ) =max( | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) | − | h G,G (cid:48) ( G ∗ ) − h G,G (cid:48) ( G (cid:48) ) | , . Since each h G,G (cid:48) in continuous, ˜ h G is also continuous. Thus, we can show that ˜ h G is the desired function in Deﬁnition 4: • h G,G (cid:48) is nonnegative ∀ G, G (cid:48) , and hence ˜ h G is nonnegative on K • If G ∗ (cid:39) G , then as each h G,G (cid:48) is permutation invariant, there is h G,G (cid:48) ( G ∗ ) = h G,G (cid:48) ( G ) ,and hence h G,G (cid:48) ( G ∗ ) = 0 . Thus, ˜ h G ( G ∗ ) = 0 . • If ∀ π ∈ S n , d ( π (cid:124) G ∗ π, G ) ≥ (cid:15) , then G ∗ ∈ K \ (cid:83) G (cid:48) ∈E ( G ) A G (cid:48) . Therefore, ∃ G (cid:48) ∈ K \ E ( G ) such that G ∗ ∈ A G (cid:48) , which implies that | h G,G (cid:48) ( G ∗ ) − h G,G (cid:48) ( G (cid:48) ) | < | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) | < | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) | . Therefore, | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) |−| h G,G (cid:48) ( G ∗ ) − h G,G (cid:48) ( G (cid:48) ) | > | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) | > , and so ˜ h G ( G ∗ ) ≥ h G,G (cid:48) ( G ∗ ) > | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) | . Deﬁne δ G = min G (cid:48) ∈ K \E ( G ) | h G,G (cid:48) ( G ) − h G,G (cid:48) ( G (cid:48) ) | > . Then if ˜ h G ( G ∗ ) < δ G , it has to be the case that G ∗ ∈ (cid:83) G (cid:48) ∈E ( G ) A G (cid:48) = (cid:83) π ∈ S n B ( π (cid:124) Gπ, (cid:15) ) , implying that ∃ π ∈ S n such that d ( G ∗ , π (cid:124) Gπ ) < (cid:15) .Finally, it is clear that ˜ h G can be realized in C +1 .12 emma 4 . Let C be a class of permutation-invariant functions K → R . If C is able to locate everyisomorphism class, then C +2 is universally approximating. Proof of Lemma 4.

Consider any f that is continuous and permutation-invariant. Since K is compact, f is uniformly continuous on K . Therefore, ∀ (cid:15) > , ∃ r > such that ∀ G , G ∈ K , if d ( G , G ) . Therefore, ψ G is well deﬁnedon K , and supp ( ψ G ) = supp ( ϕ G ) = h − G ( δ G ) . Moreover, ∀ G (cid:48) ∈ K, (cid:80) G ∈ K ψ G ( G (cid:48) ) = 1 .Therefore, the set of functions { ψ G } G ∈ K is a “partition of unity”, with respect to the open cover { h − G ( δ G ) } G ∈ K .Back to the function f that we want to approximate. We want to express it in away that resembleswhat a neural network can do. With the set of functions { ψ G } G ∈ K , we have f ( G (cid:48) ) = (cid:88) G ∈ K f ( G (cid:48) ) ψ G ( G (cid:48) ) = (cid:88) G ∈ K G (cid:48) ∈ h − G ( δ G ) f ( G (cid:48) ) ψ G ( G (cid:48) ) If G (cid:48) ∈ h − G ( δ G , then d ( G (cid:48) , G ) > r , and therefore | f ( G (cid:48) ) − f ( G ) | < (cid:15) . Hence, we can use ¯ h ( G (cid:48) ) = (cid:80) G ∈ K f ( G ) ψ G ( G (cid:48) ) to approximate f ( G (cid:48) ) , because | f ( G (cid:48) ) − (cid:88) G ∈ K f ( G ) ψ G ( G (cid:48) ) | = | f ( G (cid:48) ) − (cid:88) G ∈ K G (cid:48) ∈ h − G ( δ G ) f ( G ) ψ G ( G (cid:48) ) | = (cid:88) G ∈ K G (cid:48) ∈ h − G ( δ G ) | f ( G (cid:48) ) − f ( G ) | ψ G ( G (cid:48) ) <(cid:15) (2)Finally, we need to show how to approximate ¯ h with functions from C augmented with a multi-layerperceptron. We start with { h G } G ∈ K ⊆ C , and apply them to the input graph G (cid:48) . Then, for each of h G G (cid:48) () apply an MLP with one hidden layer to obtain ϕ G ( G (cid:48) ) , and use one node to store. theirsum, ϕ ( G (cid:48) ) . We then use an MLP with one hidden layer to approximate division, obtaining ψ G ( G (cid:48) ) .Finally, ¯ h ( G (cid:48) ) is approximated by a linear combination of { ψ G ( G (cid:48) ) } G ∈ K , since each f ( G ) is aconstant. B Proofs of Section 4.2

Theorem 5. If C is a class of permutation-invariant functions on X n × n and C is GIso-discriminating,then σ ( C ) = σ ( Q K ) Proof of Theorem 5. If C is GIso-discriminating, then given a G ∈ X n × n , ∀ G (cid:48) (cid:54)(cid:39) G, ∃ h G (cid:48) ∈ C and b G (cid:48) ∈ R such that E ( G ) = ∩ G (cid:48) (cid:54)(cid:39) G h − G (cid:48) ( { b (cid:48) G } ) , which is a ﬁnite intersection of sets in σ ( C ) . Hence, E ( G ) ∈ σ ( f G ) ⊆ σ ( C ) . Therefore, Q K ⊆ σ ( C ) , and hence σ ( Q K ) ⊆ σ ( C ) . Moreover, since σ ( g ) ⊆ σ ( Q K ) for all g ∈ C , there is σ ( C ) ⊆ σ ( Q K ) heorem 6. Let be C a class of permutation-invariant functions on X n × n with σ ( C ) = σ ( Q K ) . Then C is GIso-discriminating. Proof of Theorem 6.

Suppose not. This implies that Q K (cid:40) σ ( C ) , and hence ∃ τ = E ( G ) ∈ Q K suchthat τ / ∈ σ ( C ) . Note that τ is an equivalence class of graphs that are isomorphic to each other. Thenconsider the smallest subset in σ ( C ) that contains τ , deﬁned as S ( τ ) = (cid:92) T ∈ σ ( C ) τ ⊆ T T. Since K is a ﬁnite space, σ ( C ) is also ﬁnite, and hence this is a ﬁnite intersection. Since a sigma-algebra is closed under ﬁnite intersection, there is S ( τ ) ∈ σ ( C ) . As τ / ∈ σ ( C ) , we know that τ (cid:40) S ( τ ) . Then, ∃ G (cid:48) (cid:54)(cid:39) G such that G (cid:48) ∈ S ( τ ) . Then there does not exist any function h in C suchthat h ( G ) (cid:54) = h ( G (cid:48) ) , since otherwise the pre-image of some interval in R under h will intersect withonly E ( G ) but not E ( G (cid:48) ) . Contradiction. C Comparison of expressive power of families of functions via graphisomorphism

Given two classes of functions C , C , such as two classes of GNNs, there are four possibilitiesregarding their relative representation power, using the language of sigma-algebra developed in themain text: • σ ( C ) = σ ( C ) • σ ( C ) (cid:40) σ ( C ) • σ ( C ) (cid:40) σ ( C ) • Not comparable / None of the above (i.e., σ ( C ) (cid:42) σ ( C ) and σ ( C ) (cid:42) σ ( C ) )In this section we summarize some results from the literature and show partial relationships betweendifferent GNNs architectures in terms of their ability to distinguish non-isomorphic graphs (in thecontext of the sigma algebra introduced in Section 4). For simplicity, in this section we assume thatgraphs are given by an adjacency matrix (no node nor edge features are considered). We illustrate ourﬁndings in Figure 1. • sGNN ( M ) . We consider spectral GNNs as the ones used in [5] for community detection.In this context we focus on the simpliﬁed version where the GNNs are deﬁned as v = n v t +1 = ρ (cid:32) (cid:88) M ∈M M v t θ tM (cid:33) where θ tM ∈ R d t × d t +1 learnable parameters, v t ∈ R n × d t output : d L (cid:88) i =1 v Li . Usually M is a set of operators related to the graph. In this context we consider M = { I, A } and M ( J ) = { I, D, A, min { A t , } , t = 2 , . . . } . The operators min { A t , } allow themodel to distinguish regular graphs that order 2 G-invariant networks cannot distinguish,such as the Circular Skip Link graphs. • Linear Programming (LP) . This is not a GNN but the natural linear programming re-laxation for graph isomorphism. Namely given a pair graphs with adjacency matrix

A, B ∈ { , } n × n LP ( A, B ) = min (cid:107)

P A − BP (cid:107) subject to P n = n , P (cid:124) n = n , P ≥ . The natural sigma algebra to consider here is σ ( ∪ A ∈X n × n { LP ( A, · ) } ) . Two graphs aresaid to be fractionally isomorphic is LP ( A, B ) = 0 (i.e. the LP cannot distinguish them).[23] showed that two graphs are fractionally isomorphic if and only if they cannot bedistinguished by 1-WL. 14

Semideﬁnite Programming (SDP) . The semideﬁnite programming relaxation of quadraticassignment from [32] is based on the following observation: (cid:107)

P A − BP (cid:107) F = (cid:107) P A (cid:107) F + (cid:107) BP (cid:107) F − P AP (cid:62) B (cid:62) ) and trace(vec( P ) vec( P ) (cid:62) A ⊗ B (cid:62) ) where ⊗ is the Kro-necker product operator and vec takes an n × n matrix and ﬂattens it into an n × vector.The resulting semideﬁnite relaxation considers the vector x (cid:62) := [1 , vec( P ) (cid:62) ] and relaxesthe rank 1 matrix xx (cid:62) into a positive semideﬁnite matrix. By including the constraintscorresponding to the LP in xx (cid:62) one makes sure that solution of the SDP is always in thefeasible set of the LP, therefore the LP is less expressive than the SDP. • Sum-of-Squares (SoS) hierarchy . One can consider the hierarchy of relaxations comingfrom sum-of-squares (SoS). In the context of graph isomorphism, it is known that graphisomorphism is a hard problem for this hierarchy [22]. In particular the Lasserre/SoShierarchy requires Ω ( n ) to solve graph isomorphism (in the same sense that o ( n ) -WL failsto solve graph isomorphism [4]). • Spectral methods . If we consider the function that takes a graph and outputs the set ofeigenvalues of its adjacency matrix, such function is permutation invariant. A priori onemay think that such function, being highly non-linear, is more expressive than any formmessage passing GNN. In fact, regular graphs are not distinguished by 1-WL or order 2 G -invariant networks and may be distinguished by their eigenvalues (like the Circular SkipLink graphs). However, 1-WL and this particular spectral method are not comparable (asimple example is provided in Figure 2 of [23]). D Graph G-invariant Networks with maximum tensor order 2

In this section we prove Theorem 7 that says that graph G-invariant Networks with tensor order 2cannot distinguish between non-isomorphic regular graphs with the same degree.First, we need to state our deﬁnition of the order-2 Graph G -invariant Networks. In general, given G ∈ R n × n , we let A (0) = G , d (0) = 1 , and A ( t +1) = σ ( L ( t ) ( A ( t ) )) and outputs m ◦ h ◦ A ( L ) , where each L ( t ) is an equivariant linear layer from R n × n × d ( t ) to R n × n × d ( t +1) , σ is a point-wise activation function, h is an invariant linear layer from R n × n to R , and m is an MLP. d ( t ) is the feature dimension in layer t , interpreted as the dimension of the hidden state attached toeach pair of nodes. For simplicity of notations, in the following proof we assume that d ( t ) = 1 , ∀ t =1 , ..., L , and thus each A ( t ) is essentially a matrix. The following results can be extended to the caseswhere d ( t ) > , by adding more subscripts in the proof.Given an unweighted graph G , let E ⊆ [ n ] be the edge set of G , i.e., ( u, v ) ∈ E if u (cid:54) = v and G uv = 1 ; set S ⊆ [ n ] to be { ( u, u ) } u ∈ [ n ] ; and let N = [ n ] \ ( E ∪ S ) . Thus, E ∪ N ∪ S = [ n ] . Lemma 5.

Let

G, G (cid:48) be the adjacency matrices of two unweighted regular graphs with the samedegree d , and let A ( t ) , E, N, S and A (cid:48) ( t ) , E (cid:48) , N (cid:48) , S (cid:48) be deﬁned as above for G and G (cid:48) , respectively.Then ∀ n ≤ L, ∃ ξ ( t )1 , ξ ( t )2 , ξ ( t )3 ∈ R such that A ( t ) uv = ξ ( t )1 ( u,v ) ∈ E + ξ ( t )2 ( u,v ) ∈ N + ξ ( t )3 ( u,v ) ∈ S , and A (cid:48) ( t ) uv = ξ ( t )1 ( u,v ) ∈ E (cid:48) + ξ ( t )2 ( u,v ) ∈ N (cid:48) + ξ ( t )3 ( u,v ) ∈ S (cid:48) Proof.

We prove this lemma by induction. For t = 0 , A (0) = G and A (cid:48) (0) = G (cid:48) . Since the graph isunweighted, G uv = 1 if u (cid:54) = v and ( u, v ) ∈ E , and otherwise. Similar is true for G (cid:48) . Therefore,we can set ξ (0)1 = 1 and ξ (0)2 = ξ (0)3 = 0 .Next, we consider the inductive steps. Assume that the conditions in the lemma are satisﬁed forlayer t − . To simplify the notation, we use A, A (cid:48) to stand for A ( t − , A (cid:48) ( t − , and we assume tosatisfy the inductive hypothesis with ξ , ξ and ξ . We thus want to show that if L is any equivariantlinear, then σ ( L ( A )) , σ ( L ( A (cid:48) )) also satisﬁes the inductive hypothesis. Also, in the following, we use p , p , q , q to refer to nodes, a, b to refer to pairs of nodes, λ to refer to any equivalence class of2-tuples (i.e. pairs) of nodes, and µ to refer to any equivalence class of 4-tuples of nodes.15 a = ( p , p ) , b = ( q , q ) ∈ [ n ] , let E ( a, b ) denote the equivalence class of 4-tuples containing ( p , p , q , q ) , and let E ( b ) represent the equivalence class of 2-tuples containing ( q , q ) . Two 4-tuples ( u, v, w, x ) , ( u (cid:48) , v (cid:48) , w (cid:48) , x (cid:48) ) are considered equivalent if ∃ π ∈ S n such that π ( u ) = u (cid:48) , π ( v ) = v (cid:48) , π ( w ) = w (cid:48) , π ( x ) = x (cid:48) . Similarly is equivalence between 2-tuples deﬁned. By equation 9(b) in[17], using the notations of T, B, C, w, β deﬁned there, L is described by, given A as an input as b asthe subscript index on the output, L ( A ) b = ( n,n ) (cid:88) a =( p ,p )=(1 , T a,b A a + Y b = (cid:88) a,µ w µ B µa,b A a + (cid:88) λ β λ C λb = (cid:88) µ ( (cid:88) a ∈ [ n ] ( a,b ) ∈ µ A a ) w µ + β E ( b ) (3)First, let S bµ = (cid:88) a ∈ [ n ] ( a,b ) ∈ µ A a By the inductive hypothesis, S bµ = (cid:88) a ∈ [ n ] ( a,b ) ∈ µa ∈ E A a + (cid:88) a ∈ [ n ] ( a,b ) ∈ µa ∈ N A a + (cid:88) a ∈ [ n ] ( a,b ) ∈ µa ∈ S A a = (cid:88) a ∈ [ n ] ( a,b ) ∈ µa ∈ E ξ + (cid:88) a ∈ [ n ] ( a,b ) ∈ µa ∈ N ξ + (cid:88) a ∈ [ n ] ( a,b ) ∈ µa ∈ S ξ = m E ( b, µ ) ξ + m N ( b, µ ) ξ + m S ( b, µ ) ξ (4)where m E ( b, µ ) is deﬁned as the total number of distinct a ∈ [ n ] that satisﬁes ( a, b ) ∈ µ and a ∈ E ,and similarly for m N ( b, µ ) and m S ( b, µ ) . Formally, for example, m E ( b, µ ) = card { a ∈ [ n ] :( a, b ) ∈ µ, a ∈ E } .Since E ∪ N ∪ S = [ n ] , b belongs to one of E, N and S . Thus, let τ ( b ) = E if b ∈ E , τ ( b ) = N if b ∈ N and τ ( b ) = S if b ∈ S . It turns out that if A is the adjacency matrix of a undirectedregular graph with degree d , then m E ( b, µ ) , m N ( b, µ ) , m S ( b, µ ) can be instead written (with anabuse of notation) as m E ( τ ( b ) , µ ) , m N ( τ ( b ) , µ ) , m S ( τ ( b ) , µ ) , meaning that for a ﬁxed µ , the valuesof m E , m N and m S only depend on which of the three sets ( E, N or S ) b is in, and changing b to adifferent member in the set τ ( b ) won’t change the three numbers. In fact, for each τ ( b ) and µ , thethree numbers can be computed as functions of n and d using simple combinatorics, and their valuesare seen in the three tables 2, 3 and 4. An illustration of these numbers is given in Figure D.Therefore, we have L ( A ) b = (cid:80) µ w µ ( m E ( τ ( b ) , µ )+ m N ( τ ( b ) , µ )+ m S ( τ ( b ) , µ ))+ β E ( b ) . Moreover,notice that τ ( b ) determines E ( b ) : if τ ( b ) = E or N , then E ( b ) = E (1 , if τ ( b ) = S , then E ( b ) = E (1 , . Hence, we can write β τ ( b ) instead of β E ( b ) without loss of generality. Then inparticular, this means that L ( A ) b = L ( A ) b (cid:48) if τ ( b ) = τ ( b (cid:48) ) . Therefore, L ( A ) b = ξ b ∈ E + ξ b ∈ N + ξ b ∈ S , where ξ = (cid:80) µ w µ ( m E ( E, µ )+ m N ( E, µ )+ m S ( E, µ ))+ β E , ξ = (cid:80) µ w µ ( m E ( N, µ )+ m N ( N, µ ) + m S ( N, µ )) + β N , and ξ = (cid:80) µ w µ ( m E ( S, µ ) + m N ( S, µ ) + m S ( S, µ )) + β S .Similarly, L ( A (cid:48) ) b = ξ (cid:48) b ∈ E (cid:48) + ξ (cid:48) b ∈ N (cid:48) + ξ (cid:48) b ∈ S (cid:48) . But importantly, ∀ equivalence class of4-tuples, µ , and ∀ λ , λ ∈ { E, N, S } , m λ ( λ , µ ) = m (cid:48) λ ( λ , µ ) , as both of them can be obtainedfrom the same entry of the same table. Therefore, ξ = ξ (cid:48) , ξ = ξ (cid:48) , ξ = ξ (cid:48) .Finally, let ξ ∗ = σ ( ξ ) , ξ ∗ = σ ( ξ ) , and ξ ∗ = σ ( ξ ) . Then, there is σ ( L ( A )) b = ξ ∗ b ∈ E + ξ ∗ b ∈ N + ξ ∗ b ∈ S , and σ ( L ( A (cid:48) )) b = ξ ∗ b ∈ E (cid:48) + ξ ∗ b ∈ N (cid:48) + ξ ∗ b ∈ S (cid:48) , as desired.16 q qq qq qq Figure 4: m E ( E, E (1 , , , , m E ( E, E (1 , , , , m E ( E, E (1 , , , , m E ( E, E (1 , , , and m E ( E, E (1 , , , of G , and G , . In either graph, twice the total number of black edgesequal m E ( E, E (1 , , , (it is twice because each undirected edge corrspond to two pairs ( p , p ) and ( p , p ) , which combined with ( q , q ) both belongs to E (1 , , , ); the total numberof of red edges, , equals both m E ( E, E (1 , , , and m E ( E, E (1 , , , ; the total number ofgreen edges, also , equals both m E ( E, E (1 , , , , m E ( E, E (1 , , , . µ m E ( E, µ ) m E ( N, µ ) m E ( S, µ ) (1, 2, 3, 4) ( n − d + 2 ( n − d d − d d − d d − d d − d ( n − d (1, 1, 2, 2) 0 0 0(1, 2, 2, 2) 0 0 d (1, 2, 1, 1) 0 0 d (1, 1, 1, 1) 0 0 0Total nd nd nd Table 2: m E Since h is an invariant function, h acting on A ( L ) essentially computes the sum of all the diagonalterms (i.e., for b ∈ S ) and the sum of all the off-diagonal terms (i.e., for b ∈ E ∪ N ) of A ( L ) separatelyand then adds the two sums with two weights. If G, G (cid:48) are regular graphs with the same degree, then | E | = | E (cid:48) | , | S | = | S (cid:48) | and | N | = | N (cid:48) | . Therefore, by the lemma, there is h ( A ( L ) ) = h ( A (cid:48) ( L ) ) , andas a consequence m ( h ( A ( L ) )) = m ( h ( A (cid:48) ( L ) )) . 17 m N ( E, µ ) m N ( N, µ ) m N ( S, µ ) (1, 2, 3, 4) ( n − n − d −

1) ( n − n − d −

1) + 2 n − d − n − d − n − d − n − d − n − d − n − d − n − d − n − d − ( n − n − d − (1, 1, 2, 2) 0 0 0(1, 2, 2, 2) 0 0 n − d − (1, 2, 1, 1) 0 0 n − d − (1, 1, 1, 1) 0 0 0Total n ( n − d − n ( n − d − n ( n − d − Table 3: m N µ m S ( E, µ ) m S ( N, µ ) m S ( S, µ ) (1, 2, 3, 4) 0 0 0(1, 1, 2, 3) n − n − n − (1, 2, 2, 2) 0 0 0(1, 2, 1, 1) 0 0 0(1, 1, 1, 1) 0 0 1Total n n n Table 4: m S E Speciﬁc GNN Architectures

In section 6, we show experiments on synthetic and real datasets with several related architectures.Here are some explanations for them. • sGNN- i : sGNNs with operators from family { I, D, min( A , , . . . , min( A i − , } , i ∈{ , , } . In our experiments, the sGN N models have 5 layers and hidden layer dimension(i.e. d k ) 64. They are trained using the Adam optimizer with learning rate 0.01. • LGNN : Line Graph Neural Networks proposed by [5]. In our experiments, the sGN N models have 5 layers and hidden layer dimension (i.e. d k ) 64. They are trained using theAdam optimizer with learning rate 0.01. • GIN : Graph Isomorphism Network by [30]. We took their performance results on the IMDBdatasets reported in [30], and their performance results on the Circular Skip Link graphsexperiments reported in [20] . • RP-GIN : Graph Isomorphism Network combined with Relational pooling by [20]. We tookthe reported results reported in [20] for the Circular Skip Link graphs experiment. • Order-2 Graph G -invariant Networks : G -invariant networks based on [17] and [18], asimplemented in https://github.com/Haggaim/InvariantGraphNetworks.18 Ring-GNN : As deﬁned in the main text. The architecture (number of hidden layers, featuredimensions) is taken to be the same os the Order-2 Graph G -invariant Networks. For theexperiments on the IMDB datasets, each k ( t )1 is initialized independently under N (0 , ,and each k ( t )2 is initialized independently under N (0 , . . They are trained using theAdam optimizer with learning rate 0.00001. The initialization of k ( t )2 and the learning ratewere manually tuned, following the heuristic that Ring-GNN reduces to Order-2 Graph G -invariant Networks when k ( t )2 = 0 , and that since Ring-GNN added more operators, asmaller learning rate is likely more appropriate. • Ring-GNN-SVD : Compared with above Ring-GNN model, a Singular Value Decomposi-tion layer is added between Ring layers and fully-connected layers. SVD layer takes asinput batch-size × channels matrices and as output batch-size × channels × { , } and the model is trained using Adam optimizer with learning rate of 0.001 for350 epochs. For CSL dataset, Ring layers have numbers of channels in { , } and themodel is trained using Adam optimizer with learning rate of 0.001 for 1000 epochs. Inboth cases, each k ( t )1 is initializated independently under N (0 , . and each k ( t )2 is initial-izated independently under N (0 , . . It is also noted, since often easily dropping intoill condition when using back propagation of SVD, we clip gradient values when training.Moreover, from prospective of computation resources, Nvidia V100 and P40 are much morenumerically robust than 1080Ti and CPU in this task.For the experiments with Circular Skip Links graphs, each model is trained and evaluated using5-fold cross-validation. For Ring-GNN, in particular, we performed training ++