[PDF] Universal Equivariant Multilayer Perceptrons

Abstract

Group invariant and equivariant Multilayer Perceptrons (MLP), also known as Equivariant Networks, have achieved remarkable success in learning on a variety of data structures, such as sequences, images, sets, and graphs. Using tools from group theory, this paper proves the universality of a broad class of equivariant MLPs with a single hidden layer. In particular, it is shown that having a hidden layer on which the group acts regularly is sufficient for universal equivariance (invariance). A corollary is unconditional universality of equivariant MLPs for Abelian groups, such as CNNs with a single hidden layer. A second corollary is the universality of equivariant MLPs with a high-order hidden layer, where we give both group-agnostic bounds and means for calculating group-specific bounds on the order of hidden layer that guarantees universal equivariance (invariance).

Full PDF

UUniversal Equivariant Multilayer Perceptrons

Siamak Ravanbakhsh

Abstract

Group invariant and equivariant Multilayer Per-ceptrons (MLP), also known as Equivariant Net-works, have achieved remarkable success in learn-ing on a variety of data structures, such as se-quences, images, sets, and graphs. Using toolsfrom group theory, this paper proves the univer-sality of a broad class of equivariant MLPs with asingle hidden layer. In particular, it is shown thathaving a hidden layer on which the group acts regularly is sufﬁcient for universal equivariance(invariance). A corollary is unconditional univer-sality of equivariant MLPs for Abelian groups,such as CNNs with a single hidden layer. A sec-ond corollary is the universality of equivariantMLPs with a high-order hidden layer, where wegive both group-agnostic bounds and means forcalculating group-speciﬁc bounds on the order ofhidden layer that guarantees universal equivari-ance (invariance).

1. Introduction

Invariance and equivariance properties constrain the out-put of a function under various transformations of its input.This constraint serves as a strong learning bias that hasproven useful in sample efﬁcient learning for a wide rangeof structured data. In this work, we are interested in uni-versality results for Multilayer Perceptrons (MLPs) that areconstrained to be equivariant or invariant. This type of resultguarantees that the model can approximate any continuousequivariant (invariant) function with an arbitrary precision,in the same way an unconstrained MLP can approximate anarbitrary continuous function (Hornik et al., 1989; Cybenko,1989; Funahashi, 1989).Study of invariance in neural networks goes back to thebook of Perceptrons (Minsky & Papert, 2017), where the School of Computer Science, McGill University, MontrealCanada. Mila - Quebec AI Institute.. Correspondence to: SiamakRavanbakhsh < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s). necessity of parameter-sharing for invariance was used toprove the limitation of a single layer Perceptron. The follow-up work showed how parameter symmetries can be usedto achieve invariance to ﬁnite and inﬁnite groups (Shawe-Taylor, 1989; Wood & Shawe-Taylor, 1996; Shawe-Taylor,1993; Wood, 1996). These fundamental early works wentunnoticed during the resurgence of neural network researchand renewed attention to symmetry (Hinton et al., 2011;Mallat, 2012; Bruna & Mallat, 2013; Gens & Domingos,2014; Jaderberg et al., 2015; Dieleman et al., 2016; Cohen& Welling, 2016a).When equivariance constraints are imposed on feed-forwardlayers in an MLP, the linear maps in each layer is constrainedto use tied parameters (Wood & Shawe-Taylor, 1996; Ravan-bakhsh et al., 2017b). This model that we call an equivariantMLP appears in deep learning with sets (Zaheer et al., 2017;Qi et al., 2017), exchangeable tensors (Hartford et al., 2018),graphs (Maron et al., 2018), and relational data (Graham& Ravanbakhsh, 2019). Universality results for some ofthese models exists (Zaheer et al., 2017; Segol & Lipman,2019; Keriven & Peyr´e, 2019). Broader results for highorder invariant

MLPs appears in (Maron et al., 2019); seealso (Yarotsky, 2018).A parallel line of work in equivariant deep learning studieslinear action of a group beyond permutations. The resultingequivariant linear layers can be written using convolutionoperations (Cohen & Welling, 2016b; Kondor & Trivedi,2018). When limited to permutation groups, group convolu-tion is simply another expression of parameter-sharing (Ra-vanbakhsh et al., 2017b); see also Section 2.3. However,in working with linear representations, one may move be-yond ﬁnite groups (Cohen et al., 2019a); see also (Wood &Shawe-Taylor, 1996). Some applications include equivari-ance to isometries of the Euclidean space (Weiler & Cesa,2019; Worrall et al., 2017), and sphere (Cohen et al., 2018).Extension of this view to manifolds is proposed in (Cohenet al., 2019b). Finally, a third line of work in equivariantdeep learning that involves a specialized architecture andlearning procedure is that of Capsule networks (Sabour et al.,2017; Hinton et al., 2018); see (Lenssen et al., 2018) for agroup theoretic generalization. a r X i v : . [ c s . L G ] J un niversal Equivariant Multilayer Perceptrons This paper proves universality of equivariant MLPs for ﬁnitegroups in several settings: Our main theorems show thatany equivariant MLP with a single regular hidden layer isuniversal equivariant (invariant). This has two corollaries:1) unconditional universality for Abelian groups, includinga two-layer CNN; 2) universality of equivariant MLPs witha high-order hidden layer that subsumes existing univer-sality results for high order networks. More speciﬁcally,we prove that a high order hidden layer with an order of log( | (cid:72) | ) , where (cid:72) is the stabilizer group is universal equiv-ariant (invariant). Using the largest possible stabilizer ona set of size N , this leads to a lower-bound smaller than N log ( N ) for universal equivariance to arbitrary permuta-tion group. This bound is an improvement over the previousbound N ( N − that was shown to guarantee universal“invariance” (Maron et al., 2019). The second part of thepaper more closely examines product spaces by decompos-ing them using Burnside’s table of marks . Using this toolwe arrive at the same group-agnostic bounds above, as wellas potentially better group-speciﬁc bounds for high-orderhidden layers. For example, it is shown that equivariant(hyper) graph networks are universal for the hidden layer oforder N .

2. Preliminaries

Let (cid:71) = { (cid:103) } be a ﬁnite group. We deﬁne the action ofthis group on two ﬁnite sets (cid:78) and (cid:77) of input and outputunits in a feedforward layer. Using these actions whichdeﬁne permutation groups we then deﬁne equivariance andinvariance. In detail, (cid:71) -action on the set (cid:78) is a structurepreserving map (homomorphism) a : (cid:71) → (cid:83) (cid:78) , into thesymmetric group (cid:83) (cid:78) , the group of all permutations of (cid:78) .The image of this map is a permutation group (cid:71) (cid:78) ≤ (cid:83) (cid:78) .Instead of writing [ a ( (cid:103) )]( n ) for (cid:103) ∈ (cid:71) and n ∈ (cid:78) , we usethe short notation (cid:103) · n = g − n to denote this action. Let (cid:77) be another (cid:71) -set, where the corresponding permutationaction (cid:71) (cid:77) ≤ (cid:83) (cid:77) is deﬁned by b : (cid:71) → (cid:83) (cid:77) . (cid:71) -action on (cid:78) naturally extends to x ∈ R (cid:78) by (cid:103) · x n . = x (cid:103) · n ∀ (cid:103) ∈ (cid:71) (cid:78) . More conveniently, we also write this action as A (cid:103) x , where A g is the permutation matrix form of a ( (cid:103) , · ) : (cid:78) → (cid:78) . Let the real matrix W ∈ R | (cid:78) |×| (cid:77) | denote a linear map W : R | (cid:78) | → R | (cid:77) | . We say this map is (cid:71) -equivariant iff B (cid:103) Wx = W A (cid:103) x ∀ x ∈ R (cid:78) , (cid:103) ∈ (cid:71) . (1)where similar to A (cid:103) , the permutation matrix B (cid:103) is deﬁnedbased on the action b ( · , (cid:103) ) : (cid:77) → (cid:77) . In this deﬁnition, we Using (cid:103) − instead of (cid:103) is to make this a right action despiteappearing on the left hand side of n . assume that the group action on the input is faithful – thatis a is injective, or (cid:71) (cid:78) ∼ = (cid:71) . If the action on the outputindex set (cid:77) is not faithful, then the kernel of this actionis a non-trivial normal subgroup of (cid:71) , ker( b ) (cid:47) (cid:71) . In thiscase (cid:71) (cid:77) ∼ = (cid:71) / ker( b ) is a quotient group , and it is moreaccurate to say that W is invariant to ker( b ) and equivariantto (cid:71) / ker( b ) . Using this convention (cid:71) -equivariance and (cid:71) -invariance correspond to extreme cases of ker( b ) = (cid:71) and ker( b ) = { e } . Moreover, composition of such invariant-equivariant functions preserves this property, motivatingdesign of deep networks by stacking equivariant layers. (cid:71) (cid:78) partitions (cid:78) into orbits (cid:78) , . . . , (cid:78) O , where (cid:71) (cid:78) is transi-tive on each orbit, meaning that for each pair n , n ∈ (cid:78) o ,there is at least one (cid:103) ∈ (cid:71) (cid:78) such that (cid:103) · n = n . If (cid:71) (cid:78) has a single orbit, it is transitive, and (cid:78) is called a homoge-neous space for (cid:71) . If moreover the choice of (cid:103) ∈ (cid:71) (cid:78) with (cid:103) · n = n is unique, then (cid:71) (cid:78) is called regular .Given a subgroup (cid:72) ≤ (cid:71) and (cid:103) ∈ (cid:71) , the right coset of (cid:72) in (cid:71) , deﬁned as (cid:72)(cid:103) . = { (cid:104)(cid:103) , (cid:104) ∈ (cid:72) } is a subsetof (cid:71) . For a ﬁxed (cid:72) ≤ (cid:71) , the set of these right-cosets, (cid:72) \ (cid:71) = { (cid:72)(cid:103) , (cid:103) ∈ (cid:71) } , form a partition of (cid:71) . (cid:71) naturallyacts on the right coset space, where (cid:103) (cid:48) · ( (cid:72)(cid:103) ) . = (cid:72) ( (cid:103)(cid:103) (cid:48) ) sends one coset to another. The signiﬁcance of this action isthat “any” transitive (cid:71) -action is isomorphic to (cid:71) -action onsome right coset space. To see why, note that in this actionany (cid:104) ∈ (cid:72) stabilizes the coset (cid:72)(cid:101) , because (cid:104) · (cid:72)(cid:101) = (cid:72)(cid:101) . Therefore in any action the stabilizer identiﬁes the cosetspace.

Consider the equivariance condition of (1). Since the equal-ity holds for all x ∈ R (cid:78) , and using the fact that the inverseof a permutation matrix is its transpose, the equivarianceconstraint reduces to B (cid:103) WA (cid:62) (cid:103) = W ∀ (cid:103) ∈ (cid:71) . (2)The equation above ties the parameters within the orbits of (cid:71) -action on rows and columns of W : W ( m, n ) = W ( (cid:103) · m, (cid:103) · n ) ∀ (cid:103) ∈ (cid:71) , n, m ∈ (cid:78) × (cid:77) (3)where W ( (cid:103) · m, (cid:103) · n ) is an element of the matrix W . Thistype of group action on Cartesian product space is some-times called the diagonal action. In this case, the action ison the Cartesian product of rows and columns of W . More generally, when (cid:71) acts on the coset (cid:72)(cid:97) ∈ (cid:72) \ (cid:71) , all (cid:103) ∈ (cid:97) − (cid:72)(cid:97) stabilize (cid:72)(cid:97) . Since (cid:103) = (cid:97) − h (cid:97) for some (cid:104) ∈ (cid:72) ,we have ( (cid:97) − (cid:104)(cid:97) ) · (cid:72)(cid:97) = (cid:72) ( (cid:97)(cid:97) − (cid:104)(cid:97) ) = (cid:72)(cid:97) . This meansthat any transitive (cid:71) -action on a set (cid:78) may be identiﬁed with thestabilizer subgroup (cid:71) n . = { (cid:103) ∈ (cid:71) s.t. (cid:103) · n = n } , for a choice of n ∈ (cid:78) . This gives a bijection between (cid:78) and the right coset space (cid:71) n \ (cid:71) . niversal Equivariant Multilayer Perceptrons We saw that any homogenous (cid:71) -space is isomorphic toa coset space. Using (cid:78) ∼ = (cid:72) \ (cid:71) and (cid:77) ∼ = (cid:75) \ (cid:71) , theparameter-sharing constraint of (2) becomes W ( (cid:75)(cid:103) , (cid:72)(cid:103) (cid:48) ) = W ( (cid:103) − · (cid:75)(cid:103) , (cid:103) − · (cid:72)(cid:103) (cid:48) ) (4) = W ( (cid:75) , (cid:72)(cid:103) (cid:48) (cid:103) − ) ∀ (cid:103) , (cid:103) (cid:48) ∈ (cid:71) , (5)Since we can always multiply both indices to have thecoset (cid:75) as the ﬁrst argument, we can replace the ma-trix W with the vector w , such that W ( (cid:75)(cid:103) , (cid:72)(cid:103) (cid:48) ) = w ( (cid:72)(cid:103) (cid:48) (cid:103) − ) ∀ (cid:103) , (cid:103) (cid:48) ∈ (cid:71) . This rewriting also enables us toexpress the matrix vector multiplication of the linear map W in the form of cross-correlation of input and a kernel w [ Wx ]( n ) = [ Wx ]( (cid:75)(cid:103) ) (6) = (cid:88) (cid:72)(cid:103) (cid:48) ∈ (cid:72) \ (cid:71) W ( (cid:75)(cid:103) , (cid:72)(cid:103) (cid:48) ) x ( (cid:72)(cid:103) (cid:48) ) (7) = (cid:88) (cid:72)(cid:103) (cid:48) ∈ (cid:72) \ (cid:71) w ( (cid:72)(cid:103) (cid:48) (cid:103) − ) x ( (cid:72)(cid:103) (cid:48) ) (8)This relates the parameter-sharing view of equivariant maps(4) to the convolution view (8). Therefore, the universal-ity results in the following extends to group convolutionlayers (Cohen & Welling, 2016a; Cohen et al., 2019a), forﬁnite groups. Equivariant Afﬁne Maps

We may extend our deﬁnition,and consider afﬁne (cid:71) -maps Wx + b , by allowing an “in-variant” bias parameter b ∈ R | (cid:77) | satisfying B g b = b . (9)This implies a parameter sharing constraint b ( m ) = b ( (cid:103) · m ) . For homogeneous (cid:77) , this constraint enforces a scalar bias. Beyond homogeneous spaces, the number of freeparameters in b grows with the number of orbits. One may stack multiple layers of equivariant afﬁne mapswith multiple channels, followed by a non-linearity, so asto build an equivariant MLP . One layer of this equivariantMLP a.k.a. equivariant network is given by: x ( (cid:96) ) c = σ  C ( (cid:96) − (cid:88) c (cid:48) =1 W ( (cid:96) ) c,c (cid:48) x ( (cid:96) − c (cid:48) + b ( (cid:96) ) c  , where ≤ c (cid:48) ≤ C ( (cid:96) − and ≤ c ≤ C ( (cid:96) ) index the inputand output channels respectively, x ( (cid:96) ) is the output of layer ≤ (cid:96) ≤ L , with x (0) = x denoting the original input. Here,we assume that (cid:71) faithfully acts on all x ( (cid:96) ) c ∈ R (cid:72) ( (cid:96) ) ∀ c, (cid:96) ,with (cid:72) (0) = (cid:78) and (cid:72) ( L ) = (cid:77) . The parameter matrices W (cid:96)c ( (cid:96) ) ,c ( (cid:96) ) ∈ R (cid:72) ( (cid:96) − × (cid:72) ( (cid:96) ) , and the bias vector b ( (cid:96) ) c ∈ R (cid:72) ( (cid:96) ) are constrained by the parameter-sharing conditions (2) and Figure 1.

The equivariant MLP of (16). The symbol (cid:121) indicates (cid:71) -action on the units, W c and W (cid:48) c for all channels of the hiddenlayer c = 1 , . . . , C are constrained by the parameter-sharing of (3).If (cid:71) -action on the hidden layer is regular, the number of channelscan grow to approximate any continuous (cid:71) -equivariant functionwith an arbitrary accuracy. Bias terms are not shown. (9) respectively. In an invariant MLP the faithfulness condi-tion for (cid:71) -action on the hidden and output layers are lifted.In practice, it is common to construct invariant networksby ﬁrst constructing an equivariant network followed bypooling over (cid:72) ( L ) .

3. Universality Results

This section presents two new results on universality ofboth invariant and equivariant networks with a single hiddenlayer ( L = 2 ). Formally, we can claim that a (cid:71) -equivariantMLP ˆ ψ : R | (cid:78) | → R | (cid:77) | is a universal (cid:71) -equivariant ap-proximator , if for any (cid:71) -equivariant continuous function ψ : R | (cid:78) | → R | (cid:77) | , any compact set (cid:75) ⊂ R | (cid:78) | , and (cid:15) > ,there exists a choice of parameters, and number of channelssuch that || ψ ( x ) − ˆ ψ ( x ) || < (cid:15) ∀ x ∈ (cid:75) . Theorem 3.1. A (cid:71) -invariant network ˆ ψ ( x ) = C (cid:88) c =1 w (cid:48) c (cid:62) σ (cid:0) W c x + b c (cid:1) . (10) with a single hidden layer, on which (cid:71) acts regularlyis a universal (cid:71) -invariant approximator. Here, =[1 , . . . , (cid:62) (cid:124) (cid:123)(cid:122) (cid:125) | (cid:71) | and , b c , w (cid:48) c ∈ R .Proof. The ﬁrst step follows the symmetrisization argu-ment (Yarotsky, 2018). Since MLP is a universal approxima-tor, for any compact set K ⊂ R | (cid:78) | , we can ﬁnd ψ MLP suchthat for any (cid:15) > , | ψ ( x ) − ψ MLP ( x ) | ≤ (cid:15) for x ∈ K . Let K sym = { (cid:83) (cid:103) ∈ (cid:71) A (cid:103) x | x ∈ K } denote the symmetrisized K , which is again a compact subset of R (cid:78) for ﬁnite (cid:71) . Let ψ MLP + approximate ψ on the symmetrisized compact set niversal Equivariant Multilayer Perceptrons K sym . It is then easy to show that for (cid:71) -invariant ψ , the symmetrisized MLP ψ sym ( x ) = | (cid:71) | (cid:80) (cid:103) ∈ (cid:71) ψ MLP + ( A (cid:103) x ) also approximates ψ | ψ ( x ) − ψ sym ( x ) | = | ψ ( x ) − | (cid:71) | (cid:88) (cid:103) ∈ (cid:71) ψ MLP + ( x ) | (11) ≤ | (cid:71) | (cid:88) (cid:103) ∈ (cid:71) | ψ ( A (cid:103) x ) − ψ MLP ( A (cid:103) x ) | ≤ (cid:15). (12)Next step, is to show that ψ sym is equal to ˆ ψ of (10),for some parameters W c ∈ R | (cid:72) |×| (cid:78) | constrained so that H (cid:103) W c = W c A (cid:103) ∀ (cid:103) ∈ (cid:71) , where A (cid:103) and H (cid:103) are the per-mutation representation of (cid:71) action on the input and thehidden layer respectively. ψ sym ( x ) = 1 | (cid:71) | (cid:88) (cid:103) ∈ (cid:71) C (cid:88) c =1 w (cid:48) c σ (cid:0) w (cid:62) c ( A (cid:103) x ) (cid:1) (13) = C (cid:88) c =1 w (cid:48) c | (cid:71) | (cid:88) (cid:103) ∈ (cid:71) σ (cid:0) ( w (cid:62) c A (cid:103) ) x (cid:1) (14) = C (cid:88) c =1 ˜ w c (cid:62) σ  − w (cid:62) c A (cid:103) − ... − w (cid:62) c A (cid:103) | (cid:72) | − (cid:124) (cid:123)(cid:122) (cid:125) W c x  . (15)where in the last step we put the summation terms intorows of the matrix W c , and performed the summation usingmultiplication by (cid:62) . ˜ w c is the rescaled w (cid:48) c . Since the sum-mation in (13) is over (cid:103) ∈ (cid:71) , each row of W c and thereforeeach hidden unit is “attached” to exactly one group member,which translates to having a principal homogeneous space ,a.k.a. a regular (cid:71) -set. Note that we have the freedom tochoose the rows to have any order, corresponding to a dif-ferent order in summation, which means that the choice ofa particular principal homogeneous space is irrelevant.Now we show that the parameter matrix W c ∈ R | (cid:72) |×| (cid:78) | above satisfy the parameter-sharing constraint W c A (cid:103) = H (cid:103) W c ∀ (cid:103) ∈ (cid:71) : H (cid:103) W c A − (cid:103) =  w (cid:62) c A (cid:103) (cid:103) ... w (cid:62) c A (cid:103) | (cid:72) | (cid:103)  A (cid:103) − =  w (cid:62) c A (cid:103) ... w (cid:62) c A (cid:103) | (cid:72) |  = W c where the ﬁrst equality follows from the fact that rowindexed by (cid:103) r is moved to the row (cid:103) · (cid:103) r = (cid:103) r (cid:103) − : H (cid:103) A (cid:103) r = A (cid:103) · (cid:103) r = A (cid:103) r (cid:103) − . Therefore, the current row (cid:103) r (cid:48) was previously (cid:103) − · (cid:103) r (cid:48) = (cid:103) r (cid:48) (cid:103) . The second equalityfollows from A − (cid:103) is acting from the right, and no furtherinversion is needed A (cid:103) r (cid:103) A − (cid:103) = A (cid:103) r (cid:103)(cid:103) − = A (cid:103) r . Thisshows that a (cid:71) -invariant network with a single hidden layer on which (cid:71) acts regularly is equivalent to a symmetricizedMLP, and therefore for some number of channels, it is auniversal approximator of (cid:71) -invariant functions.This result should not be surprising since the size of a regularhidden layer grows with the group, and as it is evident fromthe proof, an equivariant MLP with a regular hidden layerimplicitly averages the output over all transformations of theinput.

Next, we apply a similar idea to prove the universalityof the equivariant

MLPs with a regular hidden layer.

Theorem 3.2. A (cid:71) -equivariant MLP ˆ ψ ( x ) = C (cid:88) c =1 W (cid:48) c σ (cid:0) W c x + b c (cid:1) . (16) with a single regular hidden layer is a universal (cid:71) -equivariant approximator.Proof. In this setting, symmetricization, using the so-called

Reynolds operator (Sturmfels, 2008), for the universal MLPis given by ψ sym ( x ) = 1 | (cid:71) | (cid:88) (cid:103) ∈ (cid:71) B (cid:103) − C (cid:88) c =1 w (cid:48) c σ (cid:0) w (cid:62) c A (cid:103) x + b c (cid:1) (17)where w c ∈ R | (cid:78) | and w (cid:48) c ∈ R (cid:77) are the weight vectorsin the ﬁrst and second layer associated with hidden unit c . Our objective is to show that this symmetrisized MLPis equivalent to the equivariant network of (16), in which W (cid:48) c ∈ R | (cid:77) |×| (cid:72) | , and W c ∈ R | (cid:72) |×| (cid:78) | use parameter-sharingto satisfy H (cid:103) W c = W c A (cid:103) and B (cid:103) W (cid:48) c = W (cid:48) c H (cid:103) ∀ (cid:103) ∈ (cid:71) . (18)Here, A (cid:103) , B (cid:103) and H (cid:103) are the permutation representationsof (cid:71) action on the input, the output, and the hidden layerrespectively.First, rewrite the symmetrisized MLP as ψ sym ( x ) = C (cid:88) c =1 (cid:88) (cid:103) ∈ (cid:71) B (cid:103) − w (cid:48) c σ (cid:0) w (cid:62) c A (cid:103) x + b c (cid:1) = C (cid:88) c =1 W (cid:48) c σ (cid:0) W c x (cid:1) where W (cid:48) c =  | | B (cid:103) − w (cid:48) c . . . B (cid:103) − | (cid:71) | w (cid:48) c | |  W c =  − w c A (cid:103) − ... − w c A (cid:103) | (cid:71) | −  , niversal Equivariant Multilayer Perceptrons and the | (cid:71) | factor is absorbed in one of the weights. It re-mains to show that the two matrices above satisfy the equiv-ariance condition H (cid:103) W c = W c A (cid:103) and B (cid:103) W (cid:48) c = W (cid:48) c H (cid:103) .The proof for W c is identical to the invariant network case.For W (cid:48) c , we use a similar approach. B g W (cid:48) c H − g =  | | B (cid:103) B (cid:103) − (cid:103) w (cid:48) c . . . B (cid:103) B (cid:103) − | (cid:71) | (cid:103) w (cid:48) c | |  =  | | B (cid:103) − w (cid:48) c . . . B (cid:103) − | (cid:71) | w (cid:48) c | |  = W (cid:48) c . In the ﬁrst step, since H − (cid:103) = H (cid:103) − is acting on the right, itmoves the column indexed by (cid:103) − l to (cid:103) − l (cid:103) − . This meansthat the column currently at (cid:103) − l (cid:48) is (cid:103) − l (cid:48) (cid:103) . The second stepuses the following: B (cid:103) B (cid:103) − l (cid:103) = B (cid:103) · ( (cid:103) − l (cid:103) ) = B (cid:103) − l (cid:103)(cid:103) − = B (cid:103) − l . This, proves the equality of the symmetrisize MLP(17) to the equivariant MLP of (16). However, a similarargument to the proof of invariant case, shows the universal-ity of ψ sym . Putting these together, completes the proof ofTheorem 3.2. In the case where (cid:71) is an

Abelian group , any faithful transi-tive action is regular, meaning that the hidden layer in a (cid:71) -equivariant neural network is necessarily regular. Combinedwith Theorem 3.2, this leads to an unconditional universalityresult for Abelian groups.

Corollary 1.

For Abelian group (cid:71) , a (cid:71) -equivariant(invariant) neural network with a single hiddenlayer is a universal approximator of continuous (cid:71) -equivariant (invariant) functions on compact subsetsof R | (cid:78) | . A corollary to this is the universality of a Convolutional Neu-ral Network (CNN) with a single hidden layer.

Corollary 2 (Universality of CNNs) . For an arbitraryinput-output dimensions, a CNN with a single hiddenlayer, full kernels, and cyclic padding is a universalapproximator of continuous circular translation equiv-ariant (invariant) functions.

Use of the term circular, both in padding and translation isbecause of the need to work with ﬁnite translations, whichare produce as the result of the action of a product of cyclic groups. (cid:71) -action on the hidden units (cid:72) naturally extends to its simul-taneous action on the Cartesian product (cid:72) D = (cid:72) × . . . × (cid:72) : (cid:103) · ( h , . . . , h D ) . = ( (cid:103) · h , . . . , (cid:103) · h D ) . We call this an order D product space . Product spacesare used in building high-order layers in (cid:71) -equivariant net-works in several recent works (Kondor et al., 2018; Maronet al., 2018; Keriven & Peyr´e, 2019; Albooyeh et al., 2019).Maron et al. (2019) show that for D ≥ | (cid:72) | ( | (cid:72) | − , (19)such MLPs with multiple hidden layers of order D becomeuniversal (cid:71) - invariant approximators. In this section, weshow that better bounds for D that guarantees universalinvariance and equivariance follows from the universalityresults of Theorems 3.1 and 3.2. The next section providesan in-depth analysis of product spaces that not only givesan alternative proof of the theorems below, but also couldlead to yet better bounds. Theorem 3.3.

Let (cid:71) act faithfully on (cid:72) ∼ = [ (cid:72) \ (cid:71) ] .Then (cid:72) D has a regular orbit for any D ≥ log ( | (cid:72) | ) and therefore, by Theorem 3.2, an order D hiddenlayer guarantees universal equivariance.Proof. If (cid:71) acts faithfully on (cid:72) , the intersection of thestabilisers of all the points in (cid:72) is trivial – i.e. , Core (cid:71) ( (cid:72) ) = { e } . If instead of taking the intersection of the stabilisers ofall h ∈ (cid:72) , we can just take the intersection of the stabilisersof D (carefully chosen) points, we will know there is aregular orbit in (cid:72) D . That is because the stabiliser of a pointin (cid:72) d is the intersection of the stabilisers of its elementsin (cid:72) , that is Stab (cid:71) ( h , ..., h D ) = (cid:84) Dd =1 Stab (cid:71) ( h d ) . So thequestion is for what value of D can we ﬁnd D points suchthat the intersection of their stabilisers is trivial. We workrecursively to ﬁnd a bound on D .Start with just one point h in (cid:72) , and assume its stabiliser isof size s . Now assume we have a point ( h , ..., h d ) in (cid:72) d Input can be zero-padded, before circular padding, so thatCorollary 2 guarantees universal approximation of translationequivariant functions, where translations are bounded by the sizethe original input. The beautiful proof for the following theorem was proposedby an anonymous reviewer. The original proof uses the ideasdiscussed in the next section and appears later in the paper. niversal Equivariant Multilayer Perceptrons such that its stabiliser is of size s d . If s d = 1 , we are done.Otherwise, since the action is faithful, there has to exist apoint h d +1 such that the intersection of all the stabilisers of h , ..., h d +1 is a strictly smaller subgroup of the stabiliserof ( h , ..., h d ) . The size of a proper subgroup is at most halfthe size of the original group and therefore s d +1 < s d / .Therefore, for each additional point the size of stabilizer atleast half of the previous stabilizer. It follows that for any D ≥ log ( | (cid:72) | ) , [ (cid:72) \ (cid:71) ] D = (cid:72) D has an orbit with a trivialstabilizer.Since the largest stabilizer for any action on (cid:72) is (cid:83) | (cid:72) |− ,we can use a lower-bound for D , in Theorem 3.3 that isindependent of the stabilizer sub-group (cid:72) . The followingbound follows from the Sterling’s approximation N !

The high-order (cid:71) -set of hidden units (cid:72) D , with N = | (cid:72) | has a regular orbit for D ≥ (cid:100) ( N −

12 ) log ( N − − ( N −

2) log ( e ) (cid:101) and following Theorem 3.2 the corresponding equiv-ariant MLP is universal approximator of continuous (cid:71) -equivariant functions.

4. Decomposition of Product (cid:71) -Sets

A prerequisite to analysis of product (cid:71) -sets is their clas-siﬁcation, which also leads to classiﬁcation of all (cid:71) -mapsbased on their input/output (cid:71) -sets. (cid:71) -Sets and (cid:71) -Maps

Recall that any transitive (cid:71) -set (cid:78) is isomorphic to a right-coset space (cid:72) \ (cid:71) . However, the right cosets (cid:72) \ (cid:71) and ( (cid:103) − (cid:72)(cid:103) ) \ (cid:71) ∀ (cid:103) ∈ (cid:71) are themselves isomorphic. Thisalso means what we care about is conjgacy classes of sub-groups [ (cid:72) ] = { (cid:103) − (cid:72)(cid:103) | (cid:103) ∈ (cid:71) } , which classiﬁes right-coset spaces up to conjugacy [ (cid:72) \ (cid:71) ] = { (cid:103) − (cid:72)(cid:103) \ (cid:71) | (cid:103) ∈ (cid:71) } . We used the bracket to identify the conjugacy class.In this notation, for (cid:72) , (cid:72) (cid:48) ≤ (cid:71) , we say [ (cid:72) ] < [ (cid:72) (cid:48) ] , iff (cid:103) − (cid:72)(cid:103) < (cid:72) (cid:48) , for some (cid:103) ∈ (cid:71) .A (cid:71) -set is transitive on each of its orbits, and we can identifyeach orbit with its stabilizer subgroup. Therefore a list of The stabilizer subgroups of two points in a homogeneousspace are conjugate, and therefore (cid:71) -sets resulting from conjugatechoice of right-cosets are isomorphic. To see why stabilizers areconjugate, assume n = (cid:97) − · n , and (cid:104) ∈ (cid:71) n , then (cid:97)(cid:104)(cid:97) − · n = n (cid:104)(cid:97) = n (cid:97) = n . Therefore, (cid:97) − (cid:104)(cid:97) ∈ (cid:71) n . Since conjugation is abijection, this means (cid:71) n = (cid:97) − (cid:71) n a . these subgroups along with their multiplicities completelydeﬁnes a (cid:71) -set up to an isomorphism (Rotman, 2012): (cid:78) ∼ = (cid:91) [ (cid:72) i ] ≤ (cid:71) p i [ (cid:72) i \ (cid:71) ] , (20)where p , . . . , p I ∈ (cid:90) ≥ denotes the multiplicity of a right-coset space, and (cid:78) has (cid:80) Ii =1 p i orbits.To ensure a faithful (cid:71) -action on (cid:78) , a necessary and sufﬁcientcondition is for the point-stabilizers (cid:71) n ∀ n ∈ (cid:78) to have atrivial intersection. The point-stabilizers within each orbitare conjugate to each other and their intersection which isthe largest normal subgroup of (cid:71) contained in (cid:72) i , is calledthe core of (cid:71) -action on [ (cid:72) i \ (cid:71) ] : Core (cid:71) ( (cid:72) i ) . = (cid:92) (cid:103) ∈ (cid:71) (cid:103) − (cid:72) i (cid:103) . (21)Next, we extend the classiﬁcation of (cid:71) -sets to (cid:71) -equivariantmaps, a.k.a. (cid:71) -maps W : R (cid:78) → R (cid:77) , by jointly classifyingthe input and the output index sets (cid:78) and (cid:77) . We mayconsider a similar expression to (20) for the output index set (cid:77) = (cid:83) [ (cid:75) j ] ≤ (cid:71) q j [ (cid:75) j \ (cid:71) ] . The linear (cid:71) -map W : R (cid:78) → R (cid:77) is then equivariant to (cid:71) / (cid:75) and invariant to (cid:75) (cid:47) (cid:71) iff (cid:92) p i > Core (cid:71) ( (cid:72) i ) = { e } and (cid:92) q i > Core (cid:71) ( (cid:75) i ) = (cid:75) (22)where the second condition translates to (cid:75) invariance of (cid:71) -action on (cid:77) . Note that the ﬁrst condition is simply ensuringthe faithfulness of (cid:71) -action on (cid:78) . This result means thatthe multiplicities ( p , . . . , p I ) and ( q , . . . , q J ) completelyidentify a (linear) (cid:71) -map W : R (cid:78) → R (cid:77) that equivariantto (cid:71) / (cid:75) and invariant to (cid:75) (cid:47) (cid:71) , up to an isomorphism. (cid:71) -sets Previously we classiﬁed all (cid:71) -sets as the disjoint union ofhomogeneous spaces (cid:83) Ii =1 p i [ (cid:71) i \ (cid:71) ] , where (cid:71) acts transi-tively on each orbit. However, as we saw earlier (cid:71) alsonaturally acts on the Cartesian product of homogeneous (cid:71) -sets: (cid:78) × . . . × (cid:78) D = ( (cid:71) \ (cid:71) ) × . . . × ( (cid:71) D \ (cid:71) ) where the action is deﬁned by (cid:103) · ( (cid:71) (cid:104) , . . . , (cid:71) D (cid:104) D ) . = ( (cid:71) ( (cid:104) (cid:103) ) , . . . , (cid:71) D ( (cid:104) D (cid:103) )) . A special case is when we consider the repeated self-productof the same homogeneous space (cid:72) ∼ = [ (cid:72) \ (cid:71) ] , which as wesaw gives an order D product space . (cid:72) D ∼ = [ (cid:72) \ (cid:71) ] D = [ (cid:72) \ (cid:71) ] × . . . × [ (cid:72) \ (cid:71) ] (cid:124) (cid:123)(cid:122) (cid:125) D times We call this an order D product space . The following discus-sion shows how the product space decomposes into orbits,where the existence of a regular orbit leads to universality. niversal Equivariant Multilayer Perceptrons (cid:71) -sets Since any (cid:71) -set can be written as a disjoint union of ho-mogeneous spaces (20), we expect a decomposition of theproduct (cid:71) -space in the form [ (cid:71) i \ (cid:71) ] × [ (cid:71) j \ (cid:71) ] = (cid:91) [ (cid:71) (cid:96) ] ≤ (cid:71) δ (cid:96)i,j [ (cid:71) (cid:96) \ (cid:71) ] (23)Indeed, this decomposition exists, and the multiplicities δ (cid:96)i,j ∈ Z > , are called the structure coefﬁcient of the Burnside Ring . The (commutative semi)ring structureis due to the fact that the set of non-isomorphic (cid:71) -sets Ω( (cid:71) ) = { (cid:83) [ (cid:71) i ] ≤ (cid:71) p i [ (cid:71) i \ (cid:71) ] | p i ∈ Z ≥ } , is equippedwith: 1) a commutative product operation that is the Carte-sian product of (cid:71) -spaces, and; 2) a summation operationthat is the disjoint union of (cid:71) -spaces (Dieck, 2006). Akey to analysis of product (cid:71) -spaces is ﬁnding the structurecoefﬁcients in (23). Example 1 (P RODUCT OF S ETS ) . The symmetricgroup (cid:83) (cid:78) acts faithfully on (cid:78) , where the stabilizeris (cid:83) n = (cid:83) (cid:78) −{ n } – that is the stabilizer of n ∈ (cid:78) is theset of all permutations of the remaining items (cid:78) − { n } .This means (cid:78) ∼ = [ (cid:83) (cid:78) −{ n } \ (cid:83) (cid:78) ] .The diagonal (cid:83) (cid:78) action on the product space (cid:78) D , de-composes into (cid:80) i p i = Bell( D ) orbits, where the Bellnumber is the number of different partitions of a setof D labelled objects (Maron et al., 2018). One mayfurther reﬁne these orbits by their type in the form of(23): [ (cid:83) (cid:78) − n \ (cid:83) (cid:78) ] D = D (cid:91) d =1 S( D, d )[ (cid:83) (cid:78) −{ n ,...,n d } \ (cid:83) (cid:78) ] (24) where the “structure coefﬁcient” S( D, d ) is the Stir-ling number of the second kind , and it counts the num-ber of ways D could be partitioned into d non-emptysets. For example, when D = 2 , one may think of theindex set (cid:78) × (cid:78) as indexing some | (cid:78) |×| (cid:78) | matrix. Thismatrix decomposes into one ( S(2 ,

1) = 1 ) diagonal [ (cid:83) (cid:78) −{ n } \ (cid:83) (cid:78) ] and one S(2 ,

2) = 1 set of off-diagonals [ (cid:83) (cid:78) −{ n ,n } \ (cid:83) (cid:78) ] . This decomposition is presented in(Albooyeh et al., 2019), where it is shown that these or-bits correspond to “hyper-diagonals” for higher ordertensors. For general groups, inferring the structuralcoefﬁcients is more challenging, as we see shortly. From (24) in the example above it follows that an order D = | (cid:78) | product of sets contains a regular orbit. The following isa corollary that combines this with the universality resultsof Theorems 3.1 and 3.2. Corollary 4. [Universality of Equivariant Hyper-Graph Networks] A (cid:83) (cid:78) equivariant network with ahidden layer of order D ≥ | (cid:78) | , is a universal approxi-mator of (cid:83) (cid:78) -equivariant (invariant) functions, wherethe input and output layer may be of any order. Note how using group speciﬁc analysis gives a betterbound of D ≥ N compared to group agnostic bound D ≥ N log( N ) of Corollary 3. A universality result forthe invariant case only, using a quadratic order appears in(Maron et al., 2019), where the MLP is called a hyper-graphnetwork . Keriven & Peyr´e (2019) prove universality for theequivariant case, without giving a bound on the order of thehidden layer, and assuming an output (cid:77) = (cid:72) of degree D = 1 . In comparison, Corollary 4 uses a linear bound andapplies to a much more general setting of arbitrary ordersfor the input and output product sets. In fact, the universalityresult is true for arbitrary input-output (cid:83) (cid:78) -sets. Linear (cid:71) -Map as a Product Space

For ﬁnite groups, thelinear (cid:71) -map W : R (cid:78) → R (cid:77) is indexed by (cid:77) × (cid:78) , andtherefore it is a product space. In fact the parameter-sharingof (3) ties all the parameters W ( m, n ) that are in the sameorbit. Therefore, the decomposition (23) also identiﬁesparameter-sharing pattern of W . Example 2 (E QUIVARIANT M APS BETWEEN S ET P RODUCTS ) . Equation (24) gives a closed form forthe decomposition of (cid:78) D into orbits. Assuming asimilar decomposition for (cid:77) D (cid:48) , the equivariant map W : R (cid:78) D → R (cid:77) D (cid:48) is decomposed in to Bell( D + D (cid:48) ) linear maps corresponding to the orbits of (cid:77) D (cid:48) × (cid:78) D . URNSIDE ’ S T ABLE OF M ARKS

Burnside’s table of marks simpliﬁes working with the mul-tiplication operation of the Burnside ring, and enables theanalysis of (cid:71) -action on product spaces (Burnside, 1911;Pfeiffer, 1997). The mark of (cid:72) ≤ (cid:71) on a ﬁnite (cid:71) -set (cid:78) , isdeﬁned as the number of points in (cid:78) ﬁxed by all (cid:104) ∈ (cid:72) : m (cid:78) ( (cid:72) ) . = |{ n ∈ (cid:78) | (cid:104) · n = n ∀ (cid:104) ∈ (cid:72) }| . (25)The interesting quality of the number of ﬁxed points is thatthe total number of ﬁxed points adds up when we add twospaces (cid:78) ∪ (cid:78) . Also, when considering product spaces (cid:78) × (cid:78) , any combination of points ﬁxed in both spaces When (cid:78) and (cid:77) are homogeneous spaces, another charac-terization the orbits of the product space [ (cid:71) n \ (cid:71) ] × [ (cid:71) m \ (cid:71) ] isby showing their one-to-one correspondence with double-cosets (cid:71) n \ (cid:71) / (cid:71) m = { (cid:71) n (cid:103)(cid:71) m | (cid:103) ∈ (cid:71) } . niversal Equivariant Multilayer Perceptrons Table 1.

Table of marks M (cid:71) . { (cid:101) } . . . (cid:71) i . . . (cid:71) j . . . (cid:71) { (cid:101) }\ (cid:71) | (cid:71) | ... ... ... (cid:71) i \ (cid:71) | (cid:71) : (cid:71) i | . . . | (cid:71) : N (cid:71) ( (cid:71) i ) | ... ... ... ... (cid:71) j \ (cid:71) | (cid:71) : (cid:71) j | . . . m (cid:71) j \ (cid:71) ( (cid:71) i ) . . . | (cid:71) : N (cid:71) ( (cid:71) j ) | ... ... ... ... ... (cid:71) \ (cid:71) will be ﬁxed by (cid:72) . This means m (cid:78) ∪ (cid:78) ( (cid:71) i ) = m (cid:78) ( (cid:71) i ) + m (cid:78) ( (cid:71) i ) (26) m (cid:78) × (cid:78) ( (cid:71) i ) = m (cid:78) ( (cid:71) i ) m (cid:78) ( (cid:71) i ) . (27)Now deﬁne the vector of marks m (cid:78) : Ω( (cid:71) ) → Z n as m (cid:78) . = [ m (cid:78) ( (cid:71) ) , . . . , m (cid:78) ( (cid:71) I )] where I is the the number of conjugacy classes of subgroupsof (cid:71) , and we have assume a ﬁxed order on [ (cid:71) i ] ≤ (cid:71) . Dueto Eqs. (26) and (27), given (cid:71) -sets (cid:78) , . . . , (cid:78) D , we can per-form elementwise addition and multiplication on the vectorof integers m (cid:78) , ..., m (cid:78) D , to obtain the mark of union andproduct (cid:71) -sets respectively. Moreover, the special qualityof marks, makes this vector an injective homeomorphism:we can work backward from the resulting vector of marksand decompose the union/product space into homogeneousspaces. To facilitate calculation of this vector, for any (cid:71) -set (cid:78) , one may use the table of marks.The table of marks for a group (cid:71) , is the square matrix ofmarks of all subgroups on all right-coset spaces – that isthe element i, j of this matrix is: M (cid:71) ( i, j ) . = m (cid:71) i \ (cid:71) ( (cid:71) j ) or M (cid:71) . =  m { e }\ (cid:71) ... m (cid:71) \ (cid:71)  . (28)The matrix M (cid:71) , has valuable information about the sub-group structure of (cid:71) . For example, (cid:71) j ’s action on (cid:71) i \ (cid:71) will have a ﬁxed point, iff [ (cid:71) j ] ≤ [ (cid:71) i ] . Therefore, thesparsity pattern in the table of marks, reﬂects the subgrouplattice structure of (cid:71) , up to conjugacy. A useful property of M (cid:71) is that we can use it to ﬁnd themarks m (cid:78) on any (cid:71) -set (cid:78) = (cid:80) i p i [ (cid:71) i \ (cid:71) ] in Ω( (cid:71) ) us-ing the expression m (cid:78) = [ p , . . . , p I ] (cid:62) M (cid:71) . Moreover, the m (cid:71) i \ (cid:71) ( (cid:71) j ) = m (cid:71) i \ (cid:71) ( (cid:103)(cid:71) j (cid:103) − ) , and m (cid:71) i \ (cid:71) ( (cid:71) j ) = m (cid:103)(cid:71) i (cid:103) − \ (cid:71) ( (cid:71) j ) ∀ (cid:103) ∈ (cid:71) . Therefore, the table of marks’ charac-terization is up to conjugacy. The sub-group lattice of (cid:71) is a partially ordered set in whichthe order (cid:71) i < (cid:71) j is a subgroup relation, and the greatest and leastelements are (cid:71) and { e } respectively. Any (cid:71) -set is isomorphic to aright-coset space produced by a member of this lattice. However,we only care about this lattice up to a conjugacy relation. This isbecause as we saw, the right cosets (cid:72) \ (cid:71) and ( (cid:103) − (cid:72)(cid:103) ) \ (cid:71) ∀ (cid:103) ∈ (cid:71) are isomorphic. Figure 2.

A high-order hidden layer decomposes into orbits, whichare characterized by the table of marks. By increasing the orderone could guarantee the existence of a regular orbit in the decom-position. By Theorem 3.2 this leads to universal equivariance. structural constants of (23) can be recovered from the tableof Marks δ (cid:96)ij = (cid:88) l M (cid:71) ( i, l ) M (cid:71) ( j, l )( M − (cid:71) )( l, (cid:96) ) . (29)

5. Universality of (cid:71) -Maps on Product Spaces

Using the tools discussed in the previous section, in thissection we prove some properties of product spaces that areconsequential in design of equivariant maps. Previously wesaw that product spaces decompose into orbits, identiﬁedby δ (cid:96)ij > in (23). The following theorem states that suchproduct spaces always have orbits that are at least as largeas the largest of the input orbits, and at least one of theseproduct orbits is strictly larger than both inputs. For simplic-ity, this theorem is stated in terms of the stabilizers, ratherthan the orbits, where by the orbit-stabilizer theorem, largerstabilizers correspond to smaller orbits. Also, while thefollowing theorem is stated for the product of homogeneous (cid:71) -sets, it trivially extends to product of (cid:71) -sets with multipleorbits. Theorem 5.1.

Let [ (cid:71) i \ (cid:71) ] and [ (cid:71) j \ (cid:71) ] be transitive (cid:71) -sets, with { e } < (cid:71) i , (cid:71) j < (cid:71) . Their product (cid:71) -set decomposes into orbits [ (cid:71) i \ (cid:71) ] × [ (cid:71) j \ (cid:71) ] = (cid:83) (cid:96) δ (cid:96)ij [ (cid:71) (cid:96) \ (cid:71) ] , such that:( i ) [ (cid:71) (cid:96) ] ≤ [ (cid:71) i ] , [ (cid:71) j ] for all the resulting orbits.( ii ) if (cid:71) j (cid:54)⊆ Core (cid:71) ( (cid:71) i ) and (cid:71) i (cid:54)⊆ Core (cid:71) ( (cid:71) j ) , then [ (cid:71) (cid:96) ] < [ (cid:71) i ] , [ (cid:71) j ] for at least one of the resulting orbit.Proof. The proof is by analysis of the table of Marks M (cid:71) .The vector of mark for the product space is the element-wiseproduct of vector of marks of the input: m [ (cid:71) i \ (cid:71) ] × [ (cid:71) i \ (cid:71) ] = niversal Equivariant Multilayer Perceptrons Table 2.

Table of marks for the alternating group (cid:65) . { (cid:101) } (cid:67) (cid:67) (cid:75) (cid:67) (cid:83) (cid:68) (cid:65) (cid:65) { (cid:101) }\ (cid:65) (cid:67) \ (cid:65)

30 2 (cid:67) \ (cid:65)

20 2 (cid:75) \ (cid:65)

15 3 3 (cid:67) \ (cid:65)

12 2 (cid:83) \ (cid:65)

10 2 1 1 (cid:68) \ (cid:65) (cid:65) \ (cid:65) (cid:65) \ (cid:65) m [ (cid:71) i \ (cid:71) ] (cid:12) m [ (cid:71) j \ (cid:71) ] . The same vector, can be written as a lin-ear combination of rows of M (cid:71) , with non-negative integercoefﬁcients: m (cid:71) i \ (cid:71) (cid:12) m (cid:71) j \ (cid:71) = (cid:80) (cid:96) δ (cid:96)ij m [ (cid:71) (cid:96) \ (cid:71) ] . For conve-nience we assume a topological ordering of the conjugacyclass of subgroups { e } = (cid:71) , . . . , (cid:71) i , . . . , (cid:71) I = (cid:71) consis-tent with their partial order – that is [ (cid:71) i ] (cid:54) > [ (cid:71) j ] ∀ j > i .This means that M (cid:71) is lower-triangular, with nonzerodiagonals; see Table 1. Three important properties ofthis table are (Pfeiffer, 1997): (1) the sparsity pattern in M (cid:71) reﬂects the subgroup relation: m [ (cid:71) i \ (cid:71) ] ( (cid:96) ) > iff (cid:71) (cid:96) ≤ (cid:71) i . (2) the ﬁrst column is the index of (cid:71) i in (cid:71) : m [ (cid:71) i \ (cid:71) ] (1) = | (cid:71) : (cid:71) i | ∀ i . (3) the diagonal element is theindex of the normalizer: m [ (cid:71) i \ (cid:71) ] ( i ) = | (cid:71) : N (cid:71) ( (cid:71) i ) | ∀ i ,where the normalizer of (cid:72) in (cid:71) is deﬁned as the largest in-termediate subgroup of (cid:71) in which (cid:72) is normal: N (cid:71) ( (cid:72) ) = { (cid:103) ∈ (cid:71) | (cid:103)(cid:72)(cid:103) − = (cid:72) } . ( i ) From (1) it follows that the non-zeros of the product ( m [ (cid:71) i \ (cid:71) ] (cid:12) m [ (cid:71) j \ (cid:71) ] )( (cid:96) ) > correspond to (cid:71) (cid:96) ≤ [ (cid:71) i ] and (cid:71) (cid:96) ≤ [ (cid:71) j ] . Since the only rows of M (cid:71) with such non-zeroelements are m [ (cid:71) (cid:96) \ (cid:71) ] for (cid:71) (cid:96) ≤ [ (cid:71) i ] ∩ (cid:71) j , all the resultingorbits have such stabilizers. This ﬁnishes the proof of theﬁrst claim.( ii ) If [ (cid:71) i ] (cid:54)≤ [ (cid:71) j ] and [ (cid:71) j ] (cid:54)≤ [ (cid:71) i ] , then [ (cid:71) (cid:96) ] which is asubgroup of both groups is strictly smaller than both, whichmeans one of the resulting orbits must be larger than bothinput orbits. Next, w.l.o.g., assume [ (cid:71) i ] ≤ [ (cid:71) j ] . Considerproof by contradiction: suppose the product does not havea strictly larger orbit. It follows that m [ (cid:71) j \ (cid:71) ] (cid:12) m [ (cid:71) i \ (cid:71) ] = δ ii,i m [ (cid:71) i \ (cid:71) ] for some δ iii > . Consider the ﬁrst and i th element of the elementwise product above: | (cid:71) : (cid:71) j | × | (cid:71) : (cid:71) i | = δ iii | (cid:71) : (cid:71) i | m [ (cid:71) j \ (cid:71) ] ( i ) × | (cid:71) : N (cid:71) ( (cid:71) i ) | = δ iii | (cid:71) : N (cid:71) ( (cid:71) i ) | Substituting δ iii = | (cid:71) : (cid:71) j | from the ﬁrst equation into thesecond equation and simplifying we get m [ (cid:71) j \ (cid:71) ] ( i ) = | (cid:71) : (cid:71) j | . This means the action of (cid:71) i on [ (cid:71) j \ (cid:71) ] ﬁxes all points,and therefore (cid:71) i ⊆ Core (cid:71) ( (cid:71) j ) as deﬁned in (21). Thiscontradicts the assumption of (ii).A sufﬁcient condition for (ii) in Theorem 5.1 is for the (cid:71) -action on input (cid:71) -sets to be faithful. Note that in thiscase the the core is trivial; see Section 4.1. An implication of this theorem is that repeated self-product [ (cid:72) \ (cid:71) ] D isbound to produce a regular orbit. This leads to Theorem 3.3,that we saw earlier. Here, we give a shorter proof usingTheorem 5.1; see Fig. 2. Alternative Proof of Theorem 3.3.

Since (cid:71) acts faithfullyon (cid:78) , Core (cid:71) ( (cid:72) ) = { (cid:101) } . From Theorem 5.1 it follows thateach time we calculate a product by (cid:78) , a strictly smallerstabilizer is produced so that (cid:72) = (cid:72) ( t =0) > (cid:72) (1) >. . . > (cid:72) ( D ) = { e } , where (cid:72) ( d ) is the smallest stabilizer attime-step d . From Lagrange theorem, the size of a propersubgroup is at most half the size of its overgroup in thissequence of stabilizers. It follows that for any D ≥ log | (cid:72) | , [ (cid:72) \ (cid:71) ] D has an orbit with (cid:72) t = D = { e } as its stabilizer. Example 3 (U NIVERSAL A PPROXIMATION FOR (cid:65) ) . The alternating group (cid:65) is the group of even permu-tations of 5 objects. One way to create a universalapproximator for this group to have a regular layer(see Theorem 3.2). A more convenient alternative isto consider the canonical action of this group on aset (cid:78) of size , and use an order D layer to ensureuniversality. Using Corollary 3 we get D ≥ (cid:100) (3 log (4)) − ( e ) (cid:101) . The natural action of (cid:65) on (cid:78) = [5] is isomorphic to [ (cid:65) \ (cid:65) ] – i.e. , (cid:65) is astabilizer. Using this stabilizer in Theorem 3.3, we getthe same bound D ≥ (cid:100) log ( | (cid:65) | ) (cid:101) . However, using the table of marks we can show that D = 3 already produces a regular orbit in this case.The table of marks for the alternating group (cid:65) isshown in Table 2. Our objective is to ﬁnd the de-composition of [ (cid:65) \ (cid:65) ] . We do this in steps, ﬁrstshowing [ (cid:65) \ (cid:65) ] = [ (cid:65) \ (cid:65) ] ∪ [ (cid:67) \ (cid:65) ] (30) To see this, note that the element-wise product of thevector of marks m [ (cid:65) \ (cid:65) ] (which is next to last row inTable 2) with itself is equal to m [ (cid:65) \ (cid:65) ] + m [ (cid:67) \ (cid:65) ] .Since the vector of marks is an injective homomor-phism, this implies (30). Applying the same idea onemore time, gives [ (cid:65) \ (cid:65) ] = ([ (cid:65) \ (cid:65) ] ∪ [ (cid:67) \ (cid:65) ]) × [ (cid:65) \ (cid:65) ]= 2[ (cid:65) \ (cid:65) ] ∪ [ (cid:67) \ (cid:65) ] ∪ [ { e }\ (cid:65) ] . This shows that [ (cid:65) \ (cid:65) ] contains a regular orbit [ { e }\ (cid:65) ] . Therefore, using an order D = 3 hiddenlayer (cid:78) on which (cid:65) acts using even permutations,also produces a universal equivariant (invariant) ap-proximator. niversal Equivariant Multilayer Perceptrons Acknowledgements

We thank anonymous reviewers for their constructive feed-back. In particular the ﬁrst proof for Theorem 3.3, as wellas clariﬁcations on the proof of the main theorems was pro-posed by reviewers. This research is in part funded by theCanada CIFAR AI Chair Program.

References

Albooyeh, M., Bertolini, D., and Ravanbakhsh, S. Incidencenetworks for geometric deep learning. arXiv preprintarXiv:1905.11460 , 2019.Bruna, J. and Mallat, S. Invariant scattering convolutionnetworks.

IEEE transactions on pattern analysis andmachine intelligence , 35(8):1872–1886, 2013.Burnside, W.

Theory of groups of ﬁnite order . University,1911.Cohen, T. S. and Welling, M. Group equivariant convolu-tional networks. arXiv preprint arXiv:1602.07576 , 2016a.Cohen, T. S. and Welling, M. Steerable cnns. arXiv preprintarXiv:1612.08498 , 2016b.Cohen, T. S., Geiger, M., K¨ohler, J., and Welling, M. Spher-ical cnns. arXiv preprint arXiv:1801.10130 , 2018.Cohen, T. S., Geiger, M., and Weiler, M. A general theory ofequivariant cnns on homogeneous spaces. In

Advances inNeural Information Processing Systems , pp. 9142–9153,2019a.Cohen, T. S., Weiler, M., Kicanaoglu, B., and Welling, M.Gauge equivariant convolutional networks and the icosa-hedral cnn. arXiv preprint arXiv:1902.04615 , 2019b.Cybenko, G. Approximation by superpositions of a sig-moidal function.

Mathematics of control, signals andsystems , 2(4):303–314, 1989.Dieck, T. T.

Transformation groups and representationtheory , volume 766. Springer, 2006.Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Exploitingcyclic symmetry in convolutional neural networks. arXivpreprint arXiv:1602.02660 , 2016.Funahashi, K.-I. On the approximate realization of continu-ous mappings by neural networks.

Neural networks , 2(3):183–192, 1989.Gens, R. and Domingos, P. M. Deep symmetry networks.In

Advances in neural information processing systems ,pp. 2537–2545, 2014.Graham, D. and Ravanbakhsh, S. Deep models for relationaldatabases. arXiv preprint arXiv:1903.09033 , 2019. Hartford, J., Graham, D. R., Leyton-Brown, K., and Ra-vanbakhsh, S. Deep models of interactions across sets.In

Proceedings of the 35th International Conference onMachine Learning , pp. 1909–1918, 2018.Hinton, G. E., Krizhevsky, A., and Wang, S. D. Trans-forming auto-encoders. In

International conference onartiﬁcial neural networks , pp. 44–51. Springer, 2011.Hinton, G. E., Sabour, S., and Frosst, N. Matrix capsuleswith em routing. 2018.Hornik, K., Stinchcombe, M., White, H., et al. Multilayerfeedforward networks are universal approximators.

Neu-ral networks , 2(5):359–366, 1989.Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatialtransformer networks. In

Advances in neural informationprocessing systems , pp. 2017–2025, 2015.Keriven, N. and Peyr´e, G. Universal invariant and equiv-ariant graph neural networks. In

Advances in NeuralInformation Processing Systems , pp. 7090–7099, 2019.Kondor, R. and Trivedi, S. On the generalization of equivari-ance and convolution in neural networks to the action ofcompact groups. arXiv preprint arXiv:1802.03690 , 2018.Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi,S. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144 , 2018.Lenssen, J. E., Fey, M., and Libuschewski, P.Group equivariant capsule networks. arXiv preprintarXiv:1806.05086 , 2018.Mallat, S. Group invariant scattering.

Communications onPure and Applied Mathematics , 65(10):1331–1398, 2012.Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y.Invariant and equivariant graph networks. arXiv preprintarXiv:1812.09902 , 2018.Maron, H., Fetaya, E., Segol, N., and Lipman, Y. Onthe universality of invariant networks. arXiv preprintarXiv:1901.09342 , 2019.Minsky, M. and Papert, S. A.

Perceptrons: An introductionto computational geometry . MIT press, 2017.Pfeiffer, G. The subgroups of m24, or how to compute the ta-ble of marks of a ﬁnite group.

Experimental Mathematics ,6(3):247–270, 1997.Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deeplearning on point sets for 3d classiﬁcation and segmenta-tion. In

Proceedings of the IEEE conference on computervision and pattern recognition , pp. 652–660, 2017. niversal Equivariant Multilayer Perceptrons

Ravanbakhsh, S., Schneider, J., and Poczos, B. Deep learn-ing with sets and point clouds. In

International Confer-ence on Learning Representations (ICLR) – workshoptrack , 2017a.Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariancethrough parameter-sharing. In

Proceedings of the 34th In-ternational Conference on Machine Learning , volume 70of

JMLR: WCP , August 2017b.Rotman, J. J.

An introduction to the theory of groups , vol-ume 148. Springer Science & Business Media, 2012.Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routingbetween capsules. In

Advances in Neural InformationProcessing Systems , pp. 3856–3866, 2017.Sannai, A., Takai, Y., and Cordonnier, M. Universal approxi-mations of permutation invariant/equivariant functions bydeep neural networks. arXiv preprint arXiv:1903.01939 ,2019.Segol, N. and Lipman, Y. On universal equivariant setnetworks. arXiv preprint arXiv:1910.02421 , 2019.Shawe-Taylor, J. Building symmetries into feedforwardnetworks. In

Artiﬁcial Neural Networks, 1989., First IEEInternational Conference on (Conf. Publ. No. 313) , pp.158–162. IET, 1989.Shawe-Taylor, J. Symmetries and discriminability in feed-forward network architectures.

IEEE Transactions onNeural Networks , 4(5):816–826, 1993.Sturmfels, B.

Algorithms in invariant theory . SpringerScience & Business Media, 2008.Weiler, M. and Cesa, G. General e (2)-equivariant steerablecnns. In

Advances in Neural Information ProcessingSystems , pp. 14334–14345, 2019.Wood, J. Invariant pattern recognition: a review.

Patternrecognition , 29(1):1–17, 1996.Wood, J. and Shawe-Taylor, J. Representation theory andinvariant neural networks.

Discrete applied mathematics ,69(1-2):33–60, 1996.Worrall, D. E., Garbin, S. J., Turmukhambetov, D., andBrostow, G. J. Harmonic networks: Deep translation androtation equivariance. In

Proc. IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) , volume 2, 2017.Yarotsky, D. Universal approximations of invariant maps byneural networks. arXiv preprint arXiv:1804.10306 , 2018.Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,Salakhutdinov, R. R., and Smola, A. J. Deep sets. In