[PDF] Efficient Generalized Spherical CNNs

Abstract

Many problems across computer vision and the natural sciences require the analysis of spherical data, for which representations may be learned efficiently by encoding equivariance to rotational symmetries. We present a generalized spherical CNN framework that encompasses various existing approaches and allows them to be leveraged alongside each other. The only existing non-linear spherical CNN layer that is strictly equivariant has complexity O( C 2 L 5 ) , where C is a measure of representational capacity and L the spherical harmonic bandlimit. Such a high computational cost often prohibits the use of strictly equivariant spherical CNNs. We develop two new strictly equivariant layers with reduced complexity O(C L 4 ) and O(C L 3 logL) , making larger, more expressive models computationally feasible. Moreover, we adopt efficient sampling theory to achieve further computational savings. We show that these developments allow the construction of more expressive hybrid models that achieve state-of-the-art accuracy and parameter efficiency on spherical benchmark problems.

Full PDF

PPublished as a conference paper at ICLR 2021 E FFICIENT G ENERALIZED S PHERICAL

CNN S Oliver J. Cobb, Christopher G. R. Wallis, Augustine N. Mavor-Parker,Augustin Marignier, Matthew A. Price, Mayeul d’Avezac & Jason D. McEwen ˚ Kagenova LimitedSpaces, Austen House, Guildford GU1 4AR, UK A BSTRACT

Many problems across computer vision and the natural sciences require the anal-ysis of spherical data, for which representations may be learned efﬁciently by en-coding equivariance to rotational symmetries. We present a generalized sphericalCNN framework that encompasses various existing approaches and allows themto be leveraged alongside each other. The only existing non-linear spherical CNNlayer that is strictly equivariant has complexity O p C L q , where C is a measureof representational capacity and L the spherical harmonic bandlimit. Such a highcomputational cost often prohibits the use of strictly equivariant spherical CNNs.We develop two new strictly equivariant layers with reduced complexity O p CL q and O p CL log L q , making larger, more expressive models computationally fea-sible. Moreover, we adopt efﬁcient sampling theory to achieve further computa-tional savings. We show that these developments allow the construction of moreexpressive hybrid models that achieve state-of-the-art accuracy and parameter ef-ﬁciency on spherical benchmark problems. NTRODUCTION

Many ﬁelds involve data that live inherently on spherical manifolds, e.g. 360 ˝ photo and video con-tent in virtual reality and computer vision, the cosmic microwave background radiation from the BigBang in cosmology, topographic and gravitational maps in planetary sciences, and molecular shapeorientations in molecular chemistry, to name just a few. Convolutional neural networks (CNNs) havebeen tremendously effective for data deﬁned on Euclidean domains, such as the 1D line, 2D plane,or nD volumes, thanks in part to their translation invariance properties. However, these techniquesare not effective for data deﬁned on spherical manifolds, which have a very different geometricstructure to Euclidean spaces (see Appendix A). To transfer the remarkable success of deep learningto data deﬁned on spherical domains, deep learning techniques deﬁned inherently on the sphere arerequired. Recently, a number of spherical CNN constructions have been proposed.Existing CNN constructions on the sphere fall broadly into three categories: fully real (i.e. pixel)space approaches (e.g. Boomsma & Frellsen, 2017; Jiang et al., 2019; Perraudin et al., 2019; Co-hen et al., 2019); combined real and harmonic space approaches (Cohen et al., 2018; Esteves et al.,2018; 2020); and fully harmonic space approaches (Kondor et al., 2018). Real space approachescan often be computed efﬁciently but they necessarily provide an approximate representation ofspherical signals and the connection to the underlying continuous symmetries of the sphere is lost.Consequently, such approaches cannot fully capture rotational equivariance. Other constructionstake a combined real and harmonic space approach (Cohen et al., 2018; Esteves et al., 2018; 2020),where sampling theorems (Driscoll & Healy, 1994; Kostelec & Rockmore, 2008) are exploited toconnect with underlying continuous signal representations to capture the continuous symmetries ofthe sphere. However, in these approaches non-linear activation functions are computed pointwisein real space, which induces aliasing errors that break strict rotational equivariance. Fully harmonicspace spherical CNNs have been constructed by Kondor et al. (2018). A continual connection withunderlying continuous signal representations is captured by using harmonic signal representations ˚ Corresponding author. a r X i v : . [ c s . C V ] O c t ublished as a conference paper at ICLR 2021throughout. Consequently, this is the only approach exhibiting strict rotational equivariance. How-ever, strict equivariance comes at great computational cost, which can often prohibit usage.In this article we present a generalized framework for CNNs on the sphere (and rotation group),which encompasses and builds on the inﬂuential approaches of Cohen et al. (2018), Esteves et al.(2018) and Kondor et al. (2018) and allows them to be leveraged alongside each other. We adopt aharmonic signal representation in order to retain the connection with underlying continuous repre-sentations and thus capture all symmetries and geometric properties of the sphere. We construct newfully harmonic (non-linear) spherical layers that are strictly rotationally equivariant, are parameter-efﬁcient, and dramatically reduce computational cost compared to similar approaches. This isachieved by a channel-wise structure, constrained generalized convolutions, and an optimized de-gree mixing set determined by a minimum spanning tree. Furthermore, we adopt efﬁcient samplingtheorems on the sphere (McEwen & Wiaux, 2011) and rotation group (McEwen et al., 2015a) to im-prove efﬁciency compared to the sampling theorems used in existing approaches (Driscoll & Healy,1994; Kostelec & Rockmore, 2008). We demonstrate state-of-the-art performance on all sphericalbenchmark problems considered, both in terms of accuracy and parameter efﬁciency. ENERALIZED S PHERICAL

CNN S We ﬁrst discuss the theoretical underpinnings of the spherical CNN frameworks introduced by Co-hen et al. (2018), Esteves et al. (2018), and Kondor et al. (2018), which make a connection to under-lying continuous signals through harmonic representations. We then present a generalized sphericallayer in which these and other existing frameworks are encompassed, allowing existing frameworksto be easily integrated and leveraged alongside each other in hybrid networks.Throughout the following we consider a network composed of S rotationally equivariant lay-ers A p q , ...., A p S q , each mapping an input activation f p i ´ q P H p i ´ q onto an output activation f p i q P H p i q . We focus on the case where the network input space H p q consists of spherical signals.2.1 S IGNALS ON THE S PHERE AND R OTATION G ROUP

A signal f P L p Ω q on the sphere ( Ω “ S ) or rotation group ( Ω “ SO p q ) can be rotated by ρ P SO p q by deﬁning the action of rotation on signals by R ρ f p ω q “ f p ρ ´ ω q for ω P Ω . Anoperator A : L p Ω q Ñ L p Ω q where Ω , Ω P t S , SO p qu is then equivariant to rotationsif R ρ p A p f qq “ A p R ρ f q for all f P L p Ω q and ρ P SO p q , i.e. rotating the function beforeapplication of the operator is equivalent to application of the operator ﬁrst, followed by a rotation.A spherical signal f P L p S q admits a harmonic representation p ˆ f , ˆ f , ..., q where ˆ f (cid:96) P C (cid:96) ` are the Fourier coefﬁcients given by the inner product x f, Y (cid:96)m y , where Y (cid:96)m are the spherical har-monic functions of natural degree (cid:96) and integer order | m | ď (cid:96) . Likewise a signal f P L p SO p qq on the rotation group admits a harmonic representation p ˆ f , ˆ f , ... q where ˆ f (cid:96) P C p (cid:96) ` qˆp (cid:96) ` q are the Fourier coefﬁcients with p m, n q -th entry x f, D (cid:96)mn y for integers | m | , | n | ď (cid:96) , where D (cid:96) : SO p q Ñ C p (cid:96) ` qˆp (cid:96) ` q is the unique (cid:96) ` dimensional irreducible group representationof SO p q on C p (cid:96) ` q . The rotation f ÞÑ R ρ f of a signal f P L p Ω q can be described in harmonicspace by ˆ f (cid:96) ÞÑ D (cid:96) p ρ q ˆ f (cid:96) . Real-world signals can be accurately represented by bandlimited signals,where ˆ f (cid:96) “ for (cid:96) ě L , for suitable bandlimit L ; henceforth, we consider bandlimited signals.Additionally, a signal on SO p q is azimuthally bandlimited at N if x f, D (cid:96)mn y “ for | n | ě N .2.2 C ONVOLUTION ON THE S PHERE AND R OTATION G ROUP

A standard deﬁnition of convolution of two signals f, ψ P L p Ω q is given by p f ‹ ψ qp ρ q “ x f, R ρ ψ y “ ż Ω d µ p ω q f p ω q ψ ˚ p ρ ´ ω q , (1)where dµ p ω q denotes the Haar measure on Ω and ¨ ˚ complex conjugation (e.g. Wandelt & G´orski,2001; McEwen et al., 2007; 2013; 2015b; 2018; Cohen et al., 2018; Esteves et al., 2018). In partic-ular, the convolution satisﬁes pp R ρ f q ‹ ψ qp ρ q “ x R ρ f, R ρ ψ y “ x f, R ρ ´ ρ ψ y “ p R ρ p f ‹ ψ qqp ρ q and is therefore a rotationally equivariant linear operation, which we shall denote by L p ψ q .2ublished as a conference paper at ICLR 2021The convolution of bandlimited signals can be computed exactly and efﬁciently in Fourier space by { p f ‹ ψ q (cid:96) “ ˆ f (cid:96) ˆ ψ (cid:96) ˚ , (cid:96) “ , ..., L ´ , (2)which for each degree (cid:96) is a vector outer product for signals on the sphere and a matrix product forsignals on the rotation group (see Appendix B for further details). Convolving in this manner resultsin signals on the rotation group. However, if the spherical ﬁlter is invariant to azimuthal rotationsthe resultant convolved signal may be interpreted as a signal on the sphere (see Appendix B).2.3 G ENERALIZED S IGNAL R EPRESENTATIONS

The harmonic representations and convolutions described above have proven useful for describingrotationally equivariant linear operators L p ψ q . Cohen et al. (2018) and Esteves et al. (2018) deﬁnespherical CNNs that sequentially apply this operator, with intermediary representations on SO p q and S respectively. A more general space is introduced for intermediary representations by Kondoret al. (2018), to which the aforementioned notions of rotation and convolution naturally extend.All bandlimited signals on the sphere and rotation group can be represented as a set of variable lengthvectors of the form f “ t ˆ f (cid:96)t P C (cid:96) ` : (cid:96) “ , .., L ´ t “ , ..., τ (cid:96)f u , where τ (cid:96)f “ for signals onthe sphere and τ (cid:96)f “ min p (cid:96) ` , N ´ q for signals on the rotation group. Let F L be the space ofall such sets of variable length vectors, which clearly includes the spaces of bandlimited signals onthe sphere and rotation group as strict subspaces. For a signal f P F L we adopt the terminology ofKondor et al. (2018) by referring to ˆ f (cid:96)t as the t -th fragment of degree (cid:96) and to τ f “ p τ f , ..., τ L ´ f q ,specifying the number of fragments for each degree, as the type of f . The action of rotations upon F L can be naturally extended from their action upon L p S q and L p SO p qq . For f P F L wedeﬁne the rotation operator f ÞÑ R ρ f by ˆ f (cid:96)t ÞÑ D (cid:96) p ρ q ˆ f (cid:96)t , allowing us to extend the usual notion ofequivariance to operators A : F L Ñ F L .2.4 G ENERALIZED C ONVOLUTIONS

The convolution described by Equation 1 provides a learnable linear operator L p ψ q that satisﬁes thedesired property of equivariance. Nevertheless, given the generalized interpretation of signals on S and SO p q as signals in F L , the notion of convolution can also be generalized (Kondor et al., 2018).In order to linearly and equivariantly transform a signal f P F L of type τ f into a new signal f ˚ ψ P F L of any desired type τ p f ˚ ψ q , we may specify a ﬁlter ψ “ t ˆ ψ (cid:96) P C τ (cid:96)f ˆ τ (cid:96) p f ˚ ψ q : (cid:96) “ , ..., L ´ u ,which in general is not an element of F L , and deﬁne a transformation f ÞÑ f ˚ ψ by p f ˚ ψ q (cid:96)t “ τ (cid:96)f ÿ t “ ˆ f (cid:96)t ˆ ψ (cid:96)t,t , (cid:96) “ , ..., L ´ t “ , ..., τ (cid:96) p f ˚ ψ q . (3)The degree- (cid:96) fragments of the transformed signal p f ˚ ψ q are simply linear combinations of thedegree- (cid:96) fragments of f , with no mixing between degrees (all equivariant linear operations take thisform; Kondor & Trivedi 2018). The harmonic representation of the standard convolution (Equa-tion 2) takes precisely this form. The generalized convolution does not force the ﬁlter ψ to occupythe same domain as the signal f and thus allows control over the type τ p f ˚ ψ q of the transformedsignal. We use L p ψ q G to denote this generalized convolutional operator.2.5 N ON - LINEAR A CTIVATION O PERATORS

For F L to be a useful representational space, it must be possible to not only linearly but also non-linearly transform its elements in an equivariant manner. However, equivariance and non-linearity isnot enough. Equivariant linear operators cannot mix information corresponding to different degrees.Therefore it is of crucial importance that degree mixing is achieved by the non-linear operator.2.5.1 P OINTWISE A CTIVATIONS

When the type τ f of f P F L permits an interpretation as a signal on S or SO p q we may performan inverse Fourier transform to map the function onto a discretized real-space representation (e.g.3ublished as a conference paper at ICLR 2021Driscoll & Healy, 1994; McEwen & Wiaux, 2011; Kostelec & Rockmore, 2008; McEwen et al.,2015a). A non-linear function σ : C Ñ C may then be applied pointwise, i.e. separately to eachsample, before performing a Fourier transform to return to a representation in F L . We denote thecorresponding non-linear operator as N σ p f q “ F p σ p F ´ p f qqq , where F represents the Fouriertransform on S or SO p q . The computational cost of the non-linear operator is dominated by theFourier transforms. While costly, fast algorithms can be leveraged (see Appendix A). While inverseand forward Fourier transforms on S or SO p q that are based on a sampling theory maintain perfectequivariance for bandlimited signals, the pointwise application of σ (most commonly ReLU) is onlyequivariant in the continuous limit L Ñ 8 . For any ﬁnite bandlimit L , aliasing effects are introducedsuch that equivariance becomes approximate only, as shown by the following experiments.We consider 100 random rotations ρ P SO p q , for each of 100 random signal-ﬁlter pairs p f, ψ q ,and compute the mean equivariance error d p A p R ρ f q , R ρ p A f qq for operator A , where d p f, g q “} f ´ g }{} f } is the relative distance between signals. For convolutions the equivariance error is . ˆ ´ for signals on S and . ˆ ´ for signals on SO p q (achieving ﬂoating point pre-cision). By comparison the equivariance error for a pointwise ReLU is . for signals on S and . for signals on SO p q . Only approximate equivariance is achieved for the ReLU since the non-linear operation spreads information to higher degrees that are not captured at the original bandlimit,resulting in aliasing. To demonstrate this point we reduce aliasing error by oversampling the real-space signal. When oversampling by ˆ or ˆ for signals on SO p q the equivariance error of theReLU is reduced to . and . , respectively. See Appendix D for further experimental details.Despite the high cost of repeated Fourier transforms and imperfect equivariance, this is neverthelessthe approach adopted by Cohen et al. (2018), Esteves et al. (2018) and others, who ﬁnd empiricallythat such models maintain a reasonable degree of equivariance.2.5.2 T ENSOR -P RODUCT A CTIVATIONS

In order to deﬁne a strictly equivariant non-linear operation that can be applied to a signal f P F L of any type τ f we leverage the decomposability of tensor products between group representations,as ﬁrst considered by Thomas et al. (2018) in the context of neural networks.Given two group representations D (cid:96) and D (cid:96) of SO p q on C (cid:96) ` and C (cid:96) ` respectively, thetensor-product group representation D (cid:96) b D (cid:96) of SO p q on C (cid:96) ` b C (cid:96) ` is deﬁned such that p D (cid:96) b D (cid:96) qp ρ q “ D (cid:96) p ρ q b D (cid:96) p ρ q for all ρ P SO p q . Decomposing D (cid:96) b D (cid:96) into a direct sumof irreducible group representations then constitutes ﬁnding a change of basis for C (cid:96) ` b C (cid:96) ` such that p D (cid:96) b D (cid:96) qp ρ q is block diagonal, where for each (cid:96) there is a block equal to D (cid:96) p ρ q . Thenecessary change of basis for ˆ u (cid:96) b ˆ v (cid:96) P C (cid:96) ` b C (cid:96) ` is given by p ˆ u (cid:96) b ˆ v (cid:96) q (cid:96)m “ (cid:96) ÿ m “´ (cid:96) (cid:96) ÿ m “´ (cid:96) C (cid:96) ,(cid:96) ,(cid:96)m ,m ,m ˆ u (cid:96) m ˆ v (cid:96) m , (4)where C (cid:96) ,(cid:96) ,(cid:96)m ,m ,m P C denote Clebsch-Gordan coefﬁcients whose symmetry properties are such that p ˆ u (cid:96) b ˆ v (cid:96) q (cid:96)m is non-zero only for | (cid:96) ´ (cid:96) | ď (cid:96) ď (cid:96) ` (cid:96) . The use of Equation 4 arises naturally inquantum mechanics when coupling angular momenta.This property is useful since if ˆ f (cid:96) P C (cid:96) ` and ˆ f (cid:96) P C (cid:96) ` are two fragments that are equiv-ariant with respect to (w.r.t.) rotations of the network input, then a rotation of ρ applied to thenetwork input results in ˆ f (cid:96) b ˆ f (cid:96) transforming as r ˆ f (cid:96) b ˆ f (cid:96) s (cid:96) ÞÑ rp D (cid:96) p ρ q ˆ f (cid:96) q b p D (cid:96) p ρ q ˆ f (cid:96) qs (cid:96) “rp D (cid:96) b D (cid:96) qp ρ qp ˆ f (cid:96) b ˆ f (cid:96) qs (cid:96) “ D (cid:96) p ρ qr ˆ f (cid:96) b ˆ f (cid:96) s (cid:96) , where the ﬁnal equality follows by block di-agonality with respect to the chosen basis. Therefore, if fragments ˆ f (cid:96) and ˆ f (cid:96) are equivariant w.r.t.rotations of the network input, then so is the fragment p C (cid:96) ,(cid:96) ,(cid:96) q J p ˆ f (cid:96) b ˆ f (cid:96) q P C (cid:96) ` , where wehave written Equation 4 more compactly. We now describe how Kondor et al. (2018) use this fact todeﬁne equivariant non-linear transformations of elements in F L .A signal f “ t ˆ f (cid:96)t P C (cid:96) ` : (cid:96) “ , .., L ´ t “ , ..., τ (cid:96)f u P F L may be equivariantly andnon-linearly transformed by an operator N b : F L Ñ F L deﬁned as N b p f q “ tp C (cid:96) ,(cid:96) ,(cid:96) q J p ˆ f (cid:96) t b ˆ f (cid:96) t q : (cid:96) “ , ..., L ´ p (cid:96) , (cid:96) q P P (cid:96)L ; t “ , ..., τ (cid:96) f ; t “ , ..., τ (cid:96) f u , (5)4ublished as a conference paper at ICLR 2021where for each degree (cid:96) P t , ..., L ´ u the set P (cid:96)L “ tp (cid:96) , (cid:96) q P t , ..., L ´ u : | (cid:96) ´ (cid:96) | ď (cid:96) ď (cid:96) ` (cid:96) u (6)is deﬁned in order to avoid the computation of trivially equivariant all-zero fragments. We make thedependence on P (cid:96)L explicit since we redeﬁne it in Section 3. Unlike the pointwise activations dis-cussed in the previous section this operator is strictly equivariant, obtaining a mean relative equivari-ance error at ﬂoating point precision of . ˆ ´ (see Appendix D). Note however that g “ N b p f q has type τ g “ p τ g , ..., τ L ´ g q where τ (cid:96)g “ ř p (cid:96) ,(cid:96) qP P (cid:96)L τ (cid:96) f τ (cid:96) f and therefore application of this non-linear operator results in a drastic expansion in representation size, which is problematic.2.6 G ENERALIZED S PHERICAL

CNN S Equipped with operators to both linearly and non-linearly transform elements of F L , with thelatter also performing degree mixing, we may consider a network with representation spaces H p q “ ... “ H p S q “ F L . We consider the s -th layer of the network to take the form of a triple A p s q “ p L , N , L q such that A p s q p f p s ´ q q “ L p N p L p f p s ´ q qqq , where L , L : F L Ñ F L arelinear operators and N : F L Ñ F L is a non-linear activation operator. The approaches of Cohenet al. (2018) and Esteves et al. (2018) are encompassed in this framework as A p s q “ p L p ψ q , N σ , I q ,where I denotes the identity operator and ψ may be deﬁned to encode real-space properties suchas localization (see Appendix C). The framework of Kondor et al. (2018) is also encompassedas A p s q “ p I , N b , L p ψ q G q . The generalized convolution comes last in this case to counteract therepresentation-expanding effect of the tensor-product activation and prevent it from compoundingas signals pass through the network. For any intermediary representation f p i q P F L we may tran-sition from equivariance with respect to the network input to invariance by discarding all but thescalar-valued fragments corresponding to (cid:96) “ (equivalent to average pooling for signals on thesphere and rotation group). Finally, note that A p s q need not take the same form for all s — withinour general framework we are free to consider hybrid approaches. FFICIENT G ENERALIZED S PHERICAL

CNN S Existing approaches to spherical convolutional layers that are encompassed within the above frame-work are computationally demanding. They require the evaluation of costly Fourier transforms onthe sphere and rotation group. Furthermore, the only strictly rotationally equivariant non-linear layeris that of Kondor et al. (2018), which has an even greater computational cost that scales with the ﬁfthpower of bandlimit — thereby limiting spatial resolution — and quadratically with the number offragments per degree — thereby limiting representational capacity. This often prohibits the use ofstrictly equivariant spherical networks. In this section we introduce a channel-wise structure, con-strained generalized convolutions, and an optimized degree mixing set in order to construct newstrictly equivariant layers that exhibit much improved scaling properties and parameter efﬁciency.Furthermore, we adopt efﬁcient sampling theory on the sphere and rotation group to achieve addi-tional computational savings.3.1 E

FFICIENT G ENERALIZED S PHERICAL L AYERS

For an activation f P F L the value ¯ τ f “ L ř L ´ (cid:96) “ τ (cid:96)f represents a resolution-independent proxyfor its representational capacity. Kondor et al. (2018) consider the separate fragments containedwithin f to subsume the traditional role of separate channels and therefore control the capacityof intermediary network representations through speciﬁcation of τ f . This is problematic because,whereas activation functions usually act on each channel separately and therefore have a cost thatscales linearly with representational capacity (usually controlled by the number of channels), for theactivation function N b not only does the cost scale quadratically with representational capacity ¯ τ f ,but so too does the size of N b p f q . This feeds forward the quadratic dependence to the cost of, andnumber of parameters required by, the proceeding generalized convolution.More speciﬁcally, note that computation of g “ N b p f q requires the computation of ř L ´ (cid:96) “ τ (cid:96)g frag-ments, where τ (cid:96)g “ ř p (cid:96) ,(cid:96) qP P (cid:96)L τ (cid:96) f τ (cid:96) f . The size of P (cid:96)L is O p L(cid:96) q for each (cid:96) and therefore the ex-5ublished as a conference paper at ICLR 2021 N ⊗ (cid:96) =0 (cid:96) =1 (cid:96) =2 (a) Prior approach to applying a tensor-product based non-linear operator N ⊗ (cid:96) =0 , , (b) Ours Figure 1: Comparison (to scale) of the expansion caused by tensor-product activations ( L “ , K “ ). Our multi-channel approach provides a K -times reduction in the cost of the non-linearoperator N b and K -times reduction in the number of parameters required by the proceeding gener-alized convolution for the same representational capacity. In practical applications K „ .panded representation has size ř L ´ (cid:96) “ τ (cid:96)g , of order O p ¯ τ f L q . By exploiting the sparsity of Clebsch-Gordan coefﬁcients ( C (cid:96) ,(cid:96) ,(cid:96)m ,m ,m “ if m ` m ‰ m ) each fragment p C (cid:96) ,(cid:96) ,(cid:96) q J p ˆ f (cid:96) b ˆ f (cid:96) q canbe computed in O p (cid:96) min p (cid:96) , (cid:96) qq . Hence, the total cost of computing all necessary fragments hascomplexity O p C L q , where C “ ¯ τ f captures representational capacity.3.1.1 C HANNEL -W ISE T ENSOR -P RODUCT A CTIVATIONS

As is more standard for CNNs we maintain a separate channels axis, with network activations takingthe form p f , ..., f K q P F KL where f i P F L all share the same type τ f . The non-linearity N b maythen be applied to each channel separately at a cost that is reduced by K -times relative to its applica-tion on a single channel with the same total number of fragments. This saving arises since for each (cid:96) we need only compute K ř p (cid:96) ,(cid:96) qP P (cid:96)L τ (cid:96) f τ (cid:96) f fragments rather than ř p (cid:96) ,(cid:96) qP P (cid:96)L p Kτ (cid:96) f qp Kτ (cid:96) f q .Figure 1 visualizes this reduction for the case K “ . Although note that for practical applications K „ is more typical. The K -times reduction in cost is substantial and allows for intermediaryactivations with orders of magnitude more representational capacity. By introducing this multi-channel approach and using C “ K rather than C “ ¯ τ f to control representational capacity, wereduce the complexity of N b w.r.t. representational capacity from O p C q to O p C q .3.1.2 C ONSTRAINED G ENERALIZED C ONVOLUTION

Although much reduced, for a signal f P F K in L the channel-wise application of N b still results in adrastically expanded representation g “ N b p f q , to which a representation-contracting generalizedconvolution must be applied in order to project onto a new activation g “ L p ψ q G p g q P F K out L ofthe desired type τ g and number of channels K out . However, under our multi-channel structurecomputational and parameter efﬁciency can be improved signiﬁcantly by decomposing L p ψ q G intothree separate linear operators, L p ψ q G , L p ψ q G and L p ψ q G .The ﬁrst, L p ψ q G , acts uniformly across channels, performing a linear projection down onto the de-sired type, interpreted as a learned extension of N b which undoes the drastic expansion. The second, L p ψ q G , then acts channel-wise, taking linear combinations of the (contracted number of) fragmentswithin each channel. The third, L p ψ q G , acts across channels, taking linear combinations to learn newfeatures. The ﬁlters take the form ψ “ t ˆ ψ (cid:96) P C τ (cid:96)g ˆ τ (cid:96)g : (cid:96) “ , ..., L ´ u , ψ “ t ˆ ψ (cid:96),k P C τ (cid:96)g ˆ τ (cid:96)g : (cid:96) “ , ..., L ´ k “ , ..., K in u and ψ “ t ˆ ψ (cid:96) P C K in ˆ K out : (cid:96) “ , ..., L ´ u .By applying the ﬁrst step uniformly across channels we minimize the parametric dependence on theexpanded representation and allow new features to be subsequently learned much more efﬁciently.Together the second and third steps can be seen as analogous to depthwise separable convolutionsoften used in planar convolutional networks.3.1.3 O PTIMIZED D EGREE M IXING S ETS

We now consider approaches to reduce the O p L q complexity w.r.t. spatial resolution L . In thedeﬁnition of N b each element of P (cid:96)L independently deﬁnes an equivariant fragment. Therefore arestricted N b in which only a subset of P (cid:96)L is used for each degree (cid:96) still deﬁnes a strictly equivariantoperator, while reducing computational complexity. In order to make savings whilst remaining atresolution L it is necessary to consider subsets of P (cid:96)L that scale better than O p L q . The challenge6ublished as a conference paper at ICLR 2021 (a) Full P (cid:96)L set of size O p L q (b) MST subset of size O p L q (c) RMST subset of size O p log L q Figure 2: Visualization of the mixing set P (cid:96)L (for L “ and (cid:96) “ ) and the approaches to subsettingbased on the minimum spanning tree (MST) and reduced minimum spanning tree (RMST) mixingpolices, which reduces related computation costs from O p L q to, respectively, O p L q or O p log L q .is to ﬁnd such subsets that do not hamper the ability of the resulting operator to inject non-linearityand mix information corresponding to different degrees (cid:96) .The following argument motivates our approach. If p (cid:96) , (cid:96) q P P (cid:96)L , then representation space isdesignated to capture the relationship between (cid:96) and (cid:96) -degree information. However, if resourceshave been designated already to capture the relationship between (cid:96) and (cid:96) -degrees, as well asbetween (cid:96) and (cid:96) -degrees, then some notion of the relationship between (cid:96) and (cid:96) -degrees has beencaptured already. Consequently, it is unnecessary to designate further resources for this purpose.More generally, consider the graph G (cid:96)L “ p N L , P (cid:96)L q with nodes N L “ t , ..., L ´ u and edges P (cid:96)L . Arestricted tensor-product activation can be constructed by using a subset of P (cid:96)L that corresponds to asubgraph of G (cid:96)L . The subgraph of G (cid:96)L captures some notion of the relationship between incoming (cid:96) and (cid:96) -degree information if it contains a path between nodes (cid:96) and (cid:96) . Therefore we are interestedin subgraphs for which there exists a path between any two nodes if there exists such a path in theoriginal graph, guaranteeing that any degree-mixing relationship captured by the original graph isalso captured by the subgraph.The smallest subgraph satisfying this property is a minimum spanning tree (MST) of G (cid:96)L . The setof edges corresponding to any MST has at most L elements and we choose to consider its unionwith the set of loop-edges in G (cid:96)L (of the form p (cid:96) , (cid:96) q ), which proved particularly important forinjecting non-linearity. We denote the resulting set as ¯ P (cid:96)L and note that it satisﬁes | ¯ P (cid:96)L | ď L .Therefore the tensor-product activation ¯ N b corresponding to Equation 6 with P (cid:96)L replaced by ¯ P (cid:96)L has reduced spatial complexity O p L q . Given that many minimal spanning trees of the unweightedgraph G (cid:96)L exist for each (cid:96) , we select the ones that minimize the cost of the resulting activation ¯ N b byassigning to each edge p (cid:96) , (cid:96) q in G (cid:96)L a weight equal to the cost of computing p C (cid:96) ,(cid:96) ,(cid:96) q J p ˆ f (cid:96) b ˆ f (cid:96) q and selecting the MST of the weighted graph.An example of P (cid:96)L and a MST subset ¯ P (cid:96)L is shown in Figure 2, where the dashed line in Figure 2cshows the general form of the MST. Using this as a principled starting point we consider the furtherreduced MST (RMST) subset ˜ P (cid:96)L corresponding to centering the MST at the edge p (cid:96), (cid:96) q and retainingonly the edges that fall a distance of i away on the dotted line for some i P N . We use ˜ N b to denotethe corresponding operator and note that it has further reduced spatial complexity of O p L log L q .We demonstrate in Section 4 that networks that make use of the MST tensor-product activationachieve state-of-the-art performance. Replacing the MST with RMST activation results in a smallbut insigniﬁcant degradation in performance, which is offset by the reduced computational cost.3.2 E FFICIENT S AMPLING T HEORY

We adopt the efﬁcient sampling theorems on sphere and rotation group of McEwen & Wiaux (2011)and McEwen et al. (2015a), respectively, which reduce the Nyquist rate by a factor of two comparedto those of Driscoll & Healy (1994) and Kostelec & Rockmore (2008), which have been adopted inother spherical CNN constructions (e.g. Cohen et al., 2018; Kondor et al., 2018; Esteves et al., 2018;2020). The sampling theorems adopted are equipped with fast algorithms to compute harmonictransforms, with complexity O p L q for the sphere and O p L q for the rotation group. When impos-ing an azimuthally bandlimit N ! L , the complexity of transforms on the rotation group can be7ublished as a conference paper at ICLR 2021Table 1: Test accuracy for spherical MNIST digits clas-siﬁcation problem NR/NR R/R NR/R ParamsPlanar CNN 99.32 90.74 11.36 58kCohen et al. (2018) 95.59 94.62 93.40 58kKondor et al. (2018) 96.40 96.60 96.00 286kEsteves et al. (2020)

Table 2: Test root mean squared (RMS)error for QM7 regression problem

RMS ParamsMontavon et al. (2012) 5.96 -Cohen et al. (2018) 8.47 1.4MKondor et al. (2018) 7.97 ą Table 3: SHREC’17 object retrieval competition metrics (perturbed micro-all)

P@N R@N F1@N mAP NDCG ParamsKondor et al. (2018) 0.707 0.722 0.701 0.683 0.756 ą - - 500kOurs reduced to O p N L q , which we often exploit in our standard (non-generalized) convolutional layers.By adopting sampling theorems on the sphere we provide access to underlying continuous signalrepresentations that fully capture the symmetries and geometric properties of the sphere, and allowstandard convolutions to be computed exactly and efﬁciently through their harmonic representations,as discussed in greater detail in Appendices A and B. XPERIMENTS

Using our efﬁcient generalized spherical CNN framework we construct networks that we apply to anumber of spherical benchmark problems (implemented in our fourpiAI software package). Weachieve state-of-the-art performance, demonstrating the ability of our approach to enhance equivari-ance without compromising representational capacity or parameter efﬁciency. In all experiments weuse a similar architecture, consisting of 2–3 standard convolutional layers (e.g. S or SO p q convo-lutions proceeded by ReLUs), followed by 2–3 of our efﬁcient generalized layers. Full details maybe found in Appendix E.4.1 R OTATED

MNIST

ON THE S PHERE

We consider the now standard benchmark problem of classifying MNIST digits projected onto thesphere. Three experimental modes NR/NR, R/R and NR/R are considered, indicating whether thetraining/test sets have been randomly rotated (R) or not (NR). Results are presented in Table 1,which shows that we closely match the prior state-of-the-art performance obtained by Esteves et al.(2020) on the NR/NR and R/R modes, whilst outperforming all previous spherical CNNs on theNR/R mode, demonstrating the increased degree of equivariance achieved by our model.4.2 A

TOMIZATION E NERGY P REDICTION

We consider the problem of regressing the atomization energy of molecules given the molecule’sCoulomb matrix and the positions of the atoms in space, using the QM7 dataset (Blum & Rey-mond, 2009; Rupp et al., 2012). Results are presented in Table 2, which shows that we dramaticallyoutperform other approaches, whilst using signiﬁcantly fewer parameters.4.3 3D S

HAPE R ETRIEVAL

We consider the 3D shape retrieval problem on the SHREC’17 (Savva et al., 2017) competitiondataset, containing 51k 3D object meshes. We follow the pre-processing step of Cohen et al. (2018),where several spherical projections of each mesh are computed, and use the ofﬁcial SHREC’17 datasplits. Results are presented in Table 3 for the standard SHREC precision and recall metrics, whichshows that we achieve state-of-the-art performance compared to other spherical CNN approaches,achieving the highest three of ﬁve performance metrics, whilst using signiﬁcantly fewer parameters. ONCLUSIONS

We have presented a generalized framework for CNNs on the sphere that encompasses various ex-isting approaches. We developed new efﬁcient layers to be used as primary building blocks in thisframework by introducing a channel-wise structure, constrained generalized convolutions, and opti-mized degree mixing sets determined by minimum spanning trees. These new efﬁcient layers exhibitstrict rotational equivariance, without compromising on representational capacity or parameter efﬁ-ciency. When combined with the ﬂexibility of the generalized framework to leverage the strengthsof alternative layers, powerful hybrid model can be constructed. On all spherical benchmark prob-lems considered we achieve state-of-the-art performance, both in terms of accuracy and parameterefﬁciency. In future work we intend to improve the scalability of our generalized framework furtherstill. In particular, we plan to introduce additional highly scalable layers, for example by extendingscattering transforms (Mallat, 2012) to the sphere, to further realize the potential of deep learningon a host of new applications where spherical data are prevalent. R EFERENCES

Lorenz Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screeningin the chemical universe database GDB-13.

Journal of the American Chemical Society , 131:8732,2009.Wouter Boomsma and Jes Frellsen. Spherical convolutions and their application in molecular mod-elling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett (eds.),

Advances in Neural Information Processing Systems 30 , pp. 3433–3443. Cur-ran Associates, Inc., 2017.Taco Cohen, Mario Geiger, Jonas K¨ohler, and Max Welling. Spherical CNNs. In

InternationalConference on Learning Representations , 2018. URL https://arxiv.org/abs/1801.10130 .Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convo-lutional networks and the icosahedral CNN. arXiv preprint arXiv:1902.04615 , 2019. URL https://arxiv.org/abs/1902.04615 .James Driscoll and Dennis Healy. Computing Fourier transforms and convolutions on the sphere.

Advances in Applied Mathematics , 15:202–250, 1994.Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. LearningSO(3) equivariant representations with spherical CNNs. In

Proceedings of the European Con-ference on Computer Vision (ECCV) , pp. 52–68, 2018. URL https://arxiv.org/abs/1711.06721 .Carlos Esteves, Ameesh Makadia, and Kostas Daniilidis. Spin-weighted spherical CNNs. arXivpreprint arXiv:2006.10731 , 2020. URL https://arxiv.org/abs/2006.10731 .Dennis Healy, Daniel Rockmore, Peter Kostelec, and S. Moore. FFTs for the 2-sphere – improve-ments and variations.

Journal of Fourier Analysis and Applications , 9(4):341–385, 2003.Chiyu Jiang, Jingwei Huang, Karthik Kashinath, Philip Marcus, Matthias Niessner, et al. SphericalCNNs on unstructured grids. arXiv preprint arXiv:1901.02039 , 2019. URL https://arxiv.org/abs/1901.02039 .Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic gradient descent. In

ICLR:International Conference on Learning Representations , 2015. URL https://arxiv.org/abs/1412.6980 .Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neuralnetworks to the action of compact groups. In

International Conference on Machine Learning , pp.2747–2755, 2018. URL https://arxiv.org/abs/1802.03690 .Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch-Gordan nets: a fully fourier space sphericalconvolutional neural network. In

Advances in Neural Information Processing Systems , pp. 10117–10126, 2018. URL https://arxiv.org/abs/1806.09231 .9ublished as a conference paper at ICLR 2021Peter Kostelec and Daniel Rockmore. FFTs on the rotation group.

Journal of Fourier Analysis andApplications , 14:145–179, 2008.St´ephane Mallat. Group invariant scattering.

Communications on Pure and Applied Mathematics ,65(10):1331–1398, 2012. URL https://arxiv.org/abs/1101.2286 .Domenico Marinucci and Giovanni Peccati.

Random Fields on the Sphere: Representation, LimitTheorem and Cosmological Applications . Cambridge University Press, 2011.Jason McEwen and Yves Wiaux. A novel sampling theorem on the sphere.

IEEE Transactions onSignal Processing , 59(12):5876–5887, 2011. URL https://arxiv.org/abs/1110.6298 .Jason McEwen, Michael P. Hobson, Daniel J. Mortlock, and Anthony N. Lasenby. Fast directionalcontinuous spherical wavelet transform algorithms.

IEEE Trans. Sig. Proc. , 55(2):520–529, 2007.URL https://arxiv.org/abs/astro-ph/0506308 .Jason McEwen, Pierre Vandergheynst, and Yves Wiaux. On the computation of directional scale-discretized wavelet transforms on the sphere. In

Wavelets and Sparsity XV, SPIE internationalsymposium on optics and photonics, invited contribution , volume 8858, 2013. URL https://arxiv.org/abs/1308.5706 .Jason McEwen, Martin B¨uttner, Boris Leistedt, Hiranya V Peiris, and Yves Wiaux. A novel sam-pling theorem on the rotation group.

IEEE Signal Processing Letters , 22(12):2425–2429, 2015a.URL https://arxiv.org/abs/1508.03101 .Jason McEwen, Boris Leistedt, Martin B¨uttner, Hiranya Peiris, and Yves Wiaux. Directional spinwavelets on the sphere.

IEEE Trans. Sig. Proc., submitted , 2015b. URL https://arxiv.org/abs/1509.06749 .Jason McEwen, Claudio Durastanti, and Yves Wiaux. Localisation of directional scale-discretisedwavelets on the sphere.

Applied Comput. Harm. Anal. , 44(1):59–88, 2018. URL https://arxiv.org/abs/1509.06767 .Gr´egoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe,Alexandre Tkatchenko, Anatole V. Lilienfeld, and Klaus-Robert M¨uller. Learning invariant repre-sentations of molecules for atomization energy prediction. In F. Pereira, C. J. C. Burges, L. Bot-tou, and K. Q. Weinberger (eds.),

Advances in Neural Information Processing Systems 25 , pp.440–448. Curran Associates, Inc., 2012.Nathana¨el Perraudin, Micha¨el Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Ef-ﬁcient spherical convolutional neural network with HEALPix sampling for cosmological appli-cations.

Astronomy and Computing , 27:130–146, 2019. URL https://arxiv.org/abs/1810.12186 .Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert M¨uller, and O. Anatole von Lilienfeld. Fastand accurate modeling of molecular atomization energies with machine learning.

Physical ReviewLetters , 108:058301, 2012. URL https://arxiv.org/abs/1109.2618 .Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi, ZhichaoZhou, Rui Yu, Song Bai, Xiang Bai, et al. Large-scale 3d shape retrieval from shapenet core55:Shrec’17 track. In

Proceedings of the Workshop on 3D Object Retrieval , pp. 39–50. EurographicsAssociation, 2017.Max Tegmark. An Icosahedron-Based method for pixelizing the celestial sphere.

Astrophys. J. Lett. ,470:L81, October 1996. URL https://arxiv.org/abs/astro-ph/9610094 .Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and PatrickRiley. Tensor ﬁeld networks: Rotation-and translation-equivariant neural networks for 3d pointclouds. arXiv preprint arXiv:1802.08219 , 2018. URL https://arxiv.org/abs/1802.08219 .Stefano Trapani and Jorge Navaza. Calculation of spherical harmonics and Wigner d functions byFFT. Applications to fast rotational matching in molecular replacement and implementation into AMoRe . Acta Crystallographica Section A , 62(4):262–269, 2006.Benjamin Wandelt and Krzysztof G´orski. Fast convolution on the sphere.

Phys. Rev. D. , 63(12):123002, 2001. URL https://arxiv.org/abs/astro-ph/0008227 .10ublished as a conference paper at ICLR 2021

A R

EPRESENTATIONS OF S IGNALS ON THE S PHERE AND R OTATION G ROUP

To provide further context for the discussion presented in the introduction and to elucidate the prop-erties of different sampling theory on the sphere and rotation group, we concisely review represen-tations of signals on the sphere and rotation group.A.1 D

ISCRETIZATION

It is well-known that a completely regular point distribution on the sphere does in general not exist(e.g. Tegmark, 1996). Consequently, while a variety of spherical discretization schemes exists (e.g.icosahedron, HEALPix, graph, and other representations), it is not possible to discretize (i.e. tosample or pixelize) the sphere in a manner that is invariant to rotations, i.e. a discrete sampling ofrotations of the samples on the sphere will in general not map onto the same set of sample positions.This differs to the Euclidean setting and has important implications when constructing convolutionoperators on the sphere, which clearly are a critical component of CNNs.Since convolution operators are in general built using a translation operator – equivalently a rotationoperator when on the sphere – it is thus not possible to construct a convolution operator directly on adiscretized representation of the sphere that captures all of the symmetries of the underlying spher-ical manifold. While approximate discrete representations can be considered, and are neverthelessuseful, such representations cannot capture all underlying spherical symmetries.A.2 S

AMPLING T HEORY

Alternative representations, however, can capture all underlying spherical symmetries. Samplingtheories on the sphere (e.g. Driscoll & Healy, 1994; McEwen & Wiaux, 2011) provide a mechanismto capture all information content of an underlying continuous function on the sphere from a ﬁniteset of samples (and similarly on the rotation group; Kostelec & Rockmore 2008; McEwen et al.2015a). A sampling theory on the sphere is equivalent to a cubature (i.e. quadrature) rule for theexact integration of a bandlimited functions on the sphere. While optimal cubature on the sphereremains an open problem, the most efﬁcient sampling theory on the sphere and rotation group is thatdeveloped by McEwen & Wiaux (2011) and McEwen et al. (2015a), respectively.On a compact manifold like the sphere (and rotation group), Fourier space is discrete. Hence, aﬁnite set of Fourier coefﬁcients captures all information content of an underlying continuous ban-dlimited signal. Since such a representation provides access to the underlying continuous signal,all symmetries and geometric properties of the sphere are captured perfectly. Such representationshave been employed extensively in the construction of wavelet transforms on the sphere, where theuse of sampling theorems on the sphere and rotation group yield wavelet transforms of discretizedcontinuous signals that are theoretically exact (e.g. McEwen et al., 2013; 2015b; 2018). Harmonicsignal representations have also been exploited in spherical CNNs to access all underlying spheri-cal symmetries and develop equivariance network layers (Cohen et al., 2018; Kondor et al., 2018;Esteves et al., 2018; 2020).A.3 E

XACT AND E FFICIENT C OMPUTATION

Signals on the sphere f P L p S q may be decomposed into their harmonic representations as f p ω q “ ÿ (cid:96) “ (cid:96) ÿ m “´ (cid:96) f (cid:96)m Y (cid:96)m p ω q , (7)where their spherical harmonic coefﬁcients are given by f (cid:96)m “ x f, Y (cid:96)m y “ ż S d µ p ω q f p ω q Y (cid:96) ˚ m p ω q , (8)for ω P S . Similarly, signals on the rotation group g P L p SO p qq may be decomposed into theirharmonic representations as g p ρ q “ ÿ (cid:96) “ (cid:96) ` π (cid:96) ÿ m “´ (cid:96) (cid:96) ÿ n “´ (cid:96) g (cid:96)mn D (cid:96) ˚ mn p ρ q (9)11ublished as a conference paper at ICLR 2021where their harmonic (Wigner) coefﬁcients are given by g (cid:96)mn “ x g, D (cid:96) ˚ mn y “ ż SO p q d µ p ρ q g p ρ q D (cid:96)mn p ρ q , (10)for ρ P SO p q . Note that we adopt the convention where the conjugate of the Wigner D -functionis used in Equation 9 since this leads to a convenient harmonic representation when consideringconvolutions (cf. McEwen et al., 2015a; 2018).As mentioned above, sampling theory pertains to strategies to capture all of the information contentof band limited signals from a ﬁnite set of samples. Since the harmonic space of the sphere androtation group is discrete, this is equivalent to an exact quadrature rule for the computation of Fouriercoefﬁcients by Equation 8 and Equation 10 from sampled signals.The canonical equiangular sampling theory on the sphere was that developed by Driscoll & Healy(1994), and subsequently extended to the rotation group by Kostelec & Rockmore (2008). Morerecently, novel sampling theorems on the sphere and rotation group were developed by McEwen& Wiaux (2011) and McEwen et al. (2015a), respectively, that reduce the Nyquist rate by a factorof two. Previous CNN constructions on the sphere (e.g. Cohen et al., 2018; Kondor et al., 2018;Esteves et al., 2018; 2020) have adopted the more well-known sampling theories of Driscoll &Healy (1994) and Kostelec & Rockmore (2008). In contrast, we adopt the more efﬁcient samplingtheories of McEwen & Wiaux (2011) and McEwen et al. (2015a) to provide additional efﬁciencysavings, implemented in the open source ssht and so3 software packages (we also make useof a TensorFlow implementation of these algorithms in our private tensossht code). Note alsothat the sampling schemes associated with the theory of McEwen & Wiaux (2011) (and other minorvariants implemented in ssht ) align more closely with the one-to-two aspect ratio of commonspherical data, such as 360 ˝ photos and videos.All of the sampling theories discussed are equipped with fast algorithms to compute Fourier trans-forms, with complexity O p L q for transforms on the sphere (Driscoll & Healy, 1994; McEwen &Wiaux, 2011) and complexity O p L q for transforms on the rotation group (Kostelec & Rockmore,2008; McEwen et al., 2015a). Note that algorithms that achieve slightly lower complexity have beendeveloped (Driscoll & Healy, 1994; Healy et al., 2003; Kostelec & Rockmore, 2008) but these areknown to suffer stability issues (Healy et al., 2003; Kostelec & Rockmore, 2008). By imposingan azimuthally bandlimit N , where typically N ! L , the complexity of transforms on the rotationgroup can be reduced to O p N L q (McEwen et al., 2015a), which we exploit in our networks.These fast algorithms to compute Fourier transforms on the sphere and rotation group can be lever-aged to yield the exact and efﬁcient computation of convolutions through their harmonic representa-tions (see Appendix B). By computing convolutions in harmonic space, pixelization and quadratureerrors are avoided and computational complexity is reduced to the cost of the respective Fouriertransforms. B C

ONVOLUTION ON THE S PHERE AND R OTATION G ROUP

For completeness we make explicit the standard (non-generalized) convolution operations on thesphere and rotation group that we adopt. The general form of convolution for signals f P L p Ω q either on the sphere ( Ω “ S ) or rotation group ( Ω “ SO p q ) is speciﬁed by Equation 1, with har-monic representation given by Equation 2. Here we provide speciﬁc expressions for the convolutionfor a variety of cases, describe the normalization constants that arise and may be absorbed into learn-able ﬁlters, and derive the corresponding harmonic forms. In practice all convolutions are computedin harmonic space since the computation is then exact, avoiding pixelisation or quadrature errors,and efﬁcient when fast algorithms to compute harmonic transforms are exploited (see Appendix A). Available on request from . ONVOLUTION ON THE S PHERE

Given two spherical signals f, ψ P L p S q their convolution, which in general is a signal on therotation group, may be decomposed as p f ‹ ψ qp ρ q “ x f, R ρ ψ y (11) “ ż S d Ω p ω q f p ω q ψ ˚ p ρ ´ ω q (12) “ ÿ (cid:96)m ÿ (cid:96) m ÿ n f (cid:96)m D (cid:96) ˚ m n p ρ q ψ (cid:96) ˚ n ż S d Ω p ω q Y (cid:96)m p ω q Y (cid:96) ˚ m p ω q (13) “ ÿ (cid:96)m ÿ (cid:96) m ÿ n f (cid:96)m D (cid:96) ˚ m n p ρ q ψ (cid:96) ˚ n δ (cid:96)(cid:96) δ mm (14) “ ÿ (cid:96)mn ` f (cid:96)m ψ (cid:96) ˚ n ˘ D (cid:96) ˚ mn p ρ q , (15)yielding harmonic coefﬁcients p f ˚ ψ q (cid:96)mn “ π (cid:96) ` f (cid:96)m ψ (cid:96) ˚ n . (16)The constants π {p (cid:96) ` q may be absorbed into learnable parameters.B.2 C ONVOLUTION ON THE S PHERE WITH A XISYMMETRIC F ILTERS

When convolving a spherical signal f P L p S q with an axisymmetric spherical ﬁlter ψ P L p S q that is invariant to azimuthal rotations, the resultant p f ‹ ψ q may be interpreted as a signal on thesphere. To see this note that an axisymmetric ﬁlter ψ has Fourier coefﬁcients ψ (cid:96)n “ ψ (cid:96) δ n thatare non-zero only for m “ . Denoting rotations by their zyz -Euler angles ρ “ p α, β, γ q andsubstituting into Equation 15 we see that the convolution may be decomposed as p f ‹ ψ qp α, β, γ q “ ÿ (cid:96)mn ` f (cid:96)m ψ (cid:96) ˚ δ n ˘ D (cid:96) ˚ mn p α, β, γ q (17) “ ÿ (cid:96)m f (cid:96)m ψ (cid:96) ˚ D (cid:96) ˚ m p α, β, q (18) “ ÿ (cid:96)m f (cid:96)m ψ (cid:96) ˚ c π (cid:96) ` Y (cid:96)m p β, α q . (19)We may therefore interpret p f ‹ ψ q as a signal on the sphere with spherical harmonic coefﬁcients p f ‹ ψ q (cid:96)m “ c π (cid:96) ` f (cid:96)m ψ (cid:96) ˚ . (20)The constants a π {p (cid:96) ` q may be absorbed into learnable parameters.13ublished as a conference paper at ICLR 2021B.3 C ONVOLUTION ON THE R OTATION G ROUP

Given two signals f, ψ P L p SO p qq on the rotation group their convolution may then be decom-posed as p f ‹ ψ qp ρ q “ x f, R ρ ψ y (21) “ ż SO p q dµ p ρ q f p ρ q ψ ˚ p ρ ´ ρ q (22) “ ż SO p q dµ p ρ q „ ÿ (cid:96) (cid:96) ` π ÿ mn f (cid:96)mn D (cid:96) ˚ mn p ρ q „ ÿ (cid:96) (cid:96) ` π ÿ m n ψ (cid:96) ˚ m n D (cid:96) m n p ρ ´ ρ q  (23) “ ÿ (cid:96) (cid:96) ` π ÿ mn f (cid:96)mn ÿ (cid:96) (cid:96) ` π ÿ m n ψ (cid:96) ˚ m n ż SO p q dµ p ρ q D (cid:96) ˚ mn p ρ q D (cid:96) m n p ρ ´ ρ q (24) “ ÿ (cid:96) (cid:96) ` π ÿ mn f (cid:96)mn ÿ (cid:96) (cid:96) ` π ÿ m n ψ (cid:96) ˚ m n ż SO p q dµ p ρ q D (cid:96) ˚ mn p ρ q ÿ k D (cid:96) ˚ km p ρ q D (cid:96) kn p ρ q (25) “ ÿ (cid:96) (cid:96) ` π ÿ mn f (cid:96)mn ÿ (cid:96) (cid:96) ` π ÿ m n ψ (cid:96) ˚ m n ÿ k D (cid:96) ˚ km p ρ q π (cid:96) ` δ (cid:96)(cid:96) δ mk δ nn (26) “ ÿ (cid:96)mm (cid:96) ` π D (cid:96) ˚ mm p ρ q ˆ ÿ n f (cid:96)mn ψ (cid:96) ˚ m n ˙ , (27)where for Equation 25 we make use of the relation (e.g. Marinucci & Peccati, 2011; McEwen et al.,2018) D (cid:96)mn p ρ ´ ρ q “ ÿ k D (cid:96) ˚ km p ρ q D (cid:96)kn p ρ q . (28)This decomposition yields harmonic coefﬁcients p f ˚ ψ q (cid:96)mn “ ÿ m f (cid:96)mm ψ (cid:96) ˚ nm . (29) C F

ILTERS ON THE S PHERE AND R OTATION G ROUP

When deﬁning ﬁlters we look to encode desirable real-space properties, such as locality and regular-ity. However, in practice considerable computation may be saved by deﬁning the ﬁlters in harmonicspace and saving the cost of Fourier transforming ahead of harmonic space convolutions. We de-scribe here how ﬁlters motivated by their real space properties may be deﬁned directly in harmonicspace.C.1 D

IRAC D ELTA F ILTERS ON THE S PHERE

Spherical ﬁlters may be constructed as a weighted sum of Dirac delta functions on the sphere. Thisconstruction is useful as the harmonic representation has an analytic form that may be computedefﬁciently. Furthermore, various real space properties can be encoded through sensible placementof the Dirac delta functions.The spherical Dirac delta function δ ω centered at ω “ p θ , φ q P S is deﬁned as δ ω p ω q “ θ δ R p cos θ ´ cos θ q δ R p φ ´ φ q , (30)where δ R is the familiar Dirac delta function on the reals centered at . The Dirac delta on the spheremay be represented in harmonic space by p δ ω q (cid:96)m “ Y (cid:96) ˚ m p ω q “ N (cid:96)m P (cid:96)m p cos θ q e ´ imφ , (31)14ublished as a conference paper at ICLR 2021which follows form the sifting property of the Dirac delta, and where Y (cid:96)m denote the sphericalharmonic functions, P (cid:96)m p x q are associated Legendre functions and N (cid:96)m “ d (cid:96) ` π p l ´ m q ! p l ` m q ! (32)is a normalizing constant.This representation may then be used to deﬁne a ﬁlter ψ P L p S q as a weighted sum of spher-ical Dirac delta functions, with weights w ij assigned to Dirac delta functions centered at points tp θ i , φ j q : i “ , ..., N θ ; j “ , ..., N φ u . The associated harmonic space representation is given by ψ (cid:96)m “ ÿ i,j w ij N (cid:96)m P (cid:96)m p cos θ i q e ´ imφ j (33) “ ÿ i N (cid:96)m P (cid:96)m p cos θ i q ÿ j w ij e ´ imφ j , (34)where fast Fourier transforms may be leveraged to compute the inner sum if the Dirac deltas arespaced evenly azimuthally (e.g. if φ j “ πj { N φ ). Alternative arbitrary samplings can of course beconsidered if useful for a problem at hand.When deﬁning ﬁlters in this manner one should be careful not to over-parametrize by assigningmore weights than needed to deﬁne a ﬁlter at the harmonic bandlimit of the signal with which wewish to convolve. For example, if the ﬁlter is to be convolved with a signal bandlimited at L thena maximum of L ´ Dirac deltas should be placed along each ring of constant θ . One may alsochoose to interpolate the weights from a smaller number of learnable parameters acting as anchorpoints, allowing higher resolution ﬁlters to be deﬁned with fewer learnable parameters.C.2 D IRAC D ELTA F ILTERS ON THE R OTATION G ROUP

Similarly a Dirac delta function δ ρ on the rotation group SO p q centered at position ρ “ p α , β , γ q P SO p q is deﬁned as δ ρ p ρ q “ β δ R p α ´ α q δ R p cos β ´ cos β q δ R p γ ´ γ q , (35)with harmonic form p δ ρ q (cid:96)mn “ D (cid:96)mn p ρ q “ e ´ imα d (cid:96)mn p β q e ´ inγ , (36)where d (cid:96)mn are Wigner (small) d -matrices.The ﬁlter ψ P L p SO p qq corresponding to a weighted sum of Dirac deltas with weights w ijk as-signed to Dirac delta functions centered at points tp α i , β j , γ k q : i “ , ..., N α ; j “ , ..., N β ; k “ , ..., N γ u has harmonic form ψ (cid:96)mn “ ÿ i,j,k w ijk e ´ imα j d (cid:96)mn p β i q e ´ inγ k (37) “ ÿ j d (cid:96)mn p β i q ÿ i e ´ imα j ÿ k w ijk e ´ inγ k , (38)where again fast Fourier transforms may be leveraged to compute the inner two sums assuming theDirac deltas are spaced evenly in α and γ . The outer sums of Equation 34 and Equation 38 can alsobe computed by fast Fourier transforms by decomposing the Wigner d -matrices into their Fourierrepresentation (cf. Trapani & Navaza, 2006; McEwen & Wiaux, 2011). One should again be carefulnot to over-parametrize. D E

QUIVARIANCE T ESTS

To test rotational equivariance of operators we consider N f “ random signals t f i u N f i “ in L p Ω q with harmonic coefﬁcients sampled from the standard normal distribution and N ρ “ t ρ j u N ρ j “ sampled uniformly on SO p q . In order to measure the extent to which anoperator A : L p Ω q Ñ L p Ω q is equivariant we evaluate the mean relative error d p A p R ρ j f i q , R ρ j p A f i qq “ N f N ρ N f ÿ i “ N ρ ÿ j “ } A p R ρ j f i q ´ R ρ j p A f i qq}} A p R ρ j f i q} (39)resulting from pre-rotation of the signal, followed by application of A , as opposed to post-rotationafter application of A , where the operator norm } ¨ } is deﬁned using the inner product x¨ , ¨y L p Ω q .Table 4 presents the mean relative equivariance errors computed. We consider the three standardconvolutions described in Appendix B (with a random ﬁlter ψ i for each signal f i , generated inthe same manner as f i ), the pointwise ReLU activation described in Section 2.5.1 for signals onthe sphere ( Ω “ S ) and rotation group ( Ω “ SO p q ), and the composition of tensor-productactivation with a generalized convolution, described in Sections 2.5.2 and 2.4, respectively. Wefollow the tensor-product activation with a generalized convolution in order to project down ontothe sphere to allow the same notion of error to be adopted as for the other operators. For consistencywith the context in which we leverage these operators, all experiments are performed using single-precision arithmetic.We see that all three standard notions of convolution and the composition of the tensor-productactivation and generalized convolution are all strictly equivariant to ﬂoating point machine precision,with errors on the order of ´ . The pointwise ReLU operator is not strictly equivariant, with amean relative error of . for signals on the rotation group and . for signals on the sphere. Theseerrors reduce when the signals are oversampled before application of the ReLU, indicating that theerror is due to aliasing induced by the spreading of information to higher degrees not capturedat the original bandlimit. For example, for the pointwise ReLU operator on the rotation groupoversampling by factors of ˆ , ˆ and ˆ results in a reduction in the mean relative equivarianceerror from . to . , . and . , respectively.Table 4: Layer equivariance testsLayer Mean Relative Error S to S conv. . ˆ ´ S to SO p q conv. . ˆ ´ SO p q to SO p q conv. . ˆ ´ Tensor-product activation Ñ Generalized conv. . ˆ ´ S ReLU . ˆ ´ S ReLU ( ˆ oversampling) . ˆ ´ S ReLU ( ˆ oversampling) . ˆ ´ S ReLU ( ˆ oversampling) . ˆ ´ SO p q ReLU . ˆ ´ SO p q ReLU ( ˆ oversampling) . ˆ ´ SO p q ReLU ( ˆ oversampling) . ˆ ´ SO p q ReLU ( ˆ oversampling) . ˆ ´ E A

DDITIONAL I NFORMATION ON E XPERIMENTS

E.1 R

OTATED

MNIST

ON THE S PHERE

For our MNIST experiments we used a hybrid model with the architecture shown in Figure 3. Theﬁrst block includes a directional convolution on the sphere that lifts the spherical input ( τ (cid:96)f p q “ )onto the rotation group ( τ (cid:96)f p q “ min p (cid:96) ` , N ´ q ). The second block includes a convolutionon the rotation group, hence its input and output both live on the rotation group. We then apply arestricted generalized convolution to map to type τ (cid:96)f p q “ r τ max {? (cid:96) ` s , where τ max “ . The sametype is used for the following three channel-wise tensor-product activations and two restricted gen-eralized convolutions until the ﬁnal restricted generalized convolution maps down to a rotationallyinvariant representation ( τ (cid:96)f p q “ δ (cid:96) ). As is traditional in convolution networks we gradually de-crease the resolution, with p L , L , L , L , L , L q “ p , , , , , q , and increase the number16ublished as a conference paper at ICLR 2021 ReLU S Layer S Conv. I SO(3)Conv.SO(3)LayerReLU I ConstrainedGen. Conv.EﬃcientGen. LayerConstrainedGen. Conv.

TensorProducts I ConstrainedGen. Conv.EﬃcientGen. Layer

TensorProducts

Figure 3: Visualization of the architecture used for the convolutional base in our hybrid models. Theinput to the ﬁrst convolutional layer is a signal on the sphere. The output from the ﬁnal convolutionallayer are scalar values corresponding to fragments of degree (cid:96) “ , which are then mapped throughsome fully connected layers to give the model output.of channels, with p K , K , K , K , K , K q “ p , , , , , q . We proceed these convolu-tional layers with a single dense layer of size , sandwiched between two dropout layers (keepprobability 0.5), and then fully connect to the output of size 10.We train the network for epochs on batches of size , using the Adam optimizer (Kingma & Ba,2015) with a decaying learning rate starting at . . For the restricted generalized convolutionswe follow the approach of Kondor et al. (2018) by using L regularization (regularization strength ´ ) and applying a restricted batch normalization across fragments, where the fragments are onlyscaled by their average and not translated (to preserve equivariance).E.2 A TOMIZATION E NERGY P REDICTION

When regressing the atomization energy of molecules there are two inputs to the model: the numberof atoms of each element contained in the molecule; and spherical cross-sections of the potentialenergy around each atom. We adopt the high-level QM7-speciﬁc architecture of Cohen et al. (2018)which contains a spherical CNN as a sub-model, for which we substitute our own. This results in anoverall model that is invariant to both rotations of the molecule around each constituent atom and topermutations of the ordering of the atoms.The ﬁrst (non-spherical) input is mapped onto a scalar output using a multi-layer perceptron (MLP)with three hidden layers of sizes , and (and ReLU activations). The second input, multiplespherical cross sections for each atom, are separately projected using a shared spherical CNN (ofarchitecture described below) onto lower dimensional vectors of size . The mean vector is thentaken across atoms (ensuring invariance w.r.t. permutations of the atoms) and mapped onto a scalaroutput using an MLP with a single hidden layer of size (with a ReLU activation). The predictedenergy is then taken to be the sum of the two scalar outputs.As a starting point we train the ﬁrst MLP to regress the atomization energies alone (achieving RMS „ ), before pairing it with the spherical model (and its connected MLP). We then train the jointmodel for 60 epochs, again with the Adam optimizer, a decaying learning rate (starting at . ˆ ´ ), regularizing the efﬁcient generalized layers with L regularization (strength . ˆ ´ ) andbatch sizes of .For the spherical component we again adopt the convolutional architecture shown inFigure 3 except with one fewer efﬁcient generalized layer. We use bandlimits of p L , L , L , L , L q“p , , , , q , channels of p K , K , K , K , K q“p , , , , q and τ max “ . One minor difference is that this time we include a skip connection between the (cid:96) “ components of the fourth and ﬁfth layer. We proceed the convolutional layers with two dense layersof size p , q and use batch normalization between each layer.17ublished as a conference paper at ICLR 2021E.3 3D S HAPE R ETRIEVAL

To project the 3D meshes of the SHREC’17 data onto spherical representations (bandlimited at L “ ) we adopt the preprocessing approach of Cohen et al. (2018) and augment the data withrandom rotations and translations.We construct a model with an architecture that is again similar to that described in Appendix E.1but with an additional axisymmetric convolutional layer prepended to the start of the net-work and one fewer efﬁcient generalized layers. We use bandlimits p L , L , L , L , L , L q “p , , , , , q , channels p K , K , K , K , K , K q “ p , , , , , q and τ max “ for the efﬁcient generalized layers. The convolutional layers are followed by a dense layer of size which is fully connected to the output (of size 55).We again train with the Adam optimizer, a decaying learning rate (starting at ˆ ´ ) and batchsizes of , this time until performance on the validation set showed no improvement for at least epochs ( epochs in total). We perform batch normalization between convolutional layers anddropout preceding the dense layer. We regularize the efﬁcient generalized layers with L regu-larization (strength ´5