[PDF] Embed Me If You Can: A Geometric Perceptron

Abstract

Solving geometric tasks involving point clouds by using machine learning is a challenging problem. Standard feed-forward neural networks combine linear or, if the bias parameter is included, affine layers and activation functions. Their geometric modeling is limited, which motivated the prior work introducing the multilayer hypersphere perceptron (MLHP). Its constituent part, i.e., hypersphere neuron, is obtained by applying a conformal embedding of Euclidean space. By virtue of Clifford algebra, it can be implemented as the Cartesian dot product of inputs and weights. If the embedding is applied in a manner consistent with the dimensionality of the input space geometry, the decision surfaces of the model units become combinations of hyperspheres and make the decision-making process geometrically interpretable for humans. Our extension of the MLHP model, the multilayer geometric perceptron (MLGP), and its respective layer units, i.e., geometric neurons, are consistent with the 3D geometry and provide a geometric handle of the learned coefficients. In particular, the geometric neuron activations are isometric in 3D. When classifying the 3D Tetris shapes, we quantitatively show that our model requires no activation function in the hidden layers other than the embedding to outperform the vanilla multilayer perceptron. In the presence of noise in the data, our model is also superior to the MLHP.

Full PDF

EEmbed Me If You Can: A Geometric Perceptron

Pavlo Melnyk, Michael Felsberg, Mårten Wadenbäck

Computer Vision Laboratory, Department of Electrical Engineering, Linköping University {pavlo.melnyk, michael.felsberg, marten.wadenback}@liu.se

Abstract

Solving geometric tasks using machine learning is a challenging problem. Standardfeed-forward neural networks combine linear or, if the bias parameter is included,afﬁne layers and activation functions. Their geometric modeling is limited, whichis why we introduce the alternative model of the multilayer geometric perceptron(MLGP) with units that are geometric neurons, i.e., combinations of hypersphereneurons. The hypersphere neuron is obtained by applying a conformal embeddingof Euclidean space. By virtue of Clifford algebra, it can be implemented as theCartesian dot product. We validate our method on the public 3D Tetris datasetconsisting of coordinates of geometric shapes and we show that our method hasthe capability of generalization over geometric transformations. We demonstratethat our model is superior to the vanilla multilayer perceptron (MLP) while havingfewer parameters and no activation function in the hidden layers other than theembedding. In the presence of noise in the data, our model is also superior to themultilayer hypersphere perceptron (MLHP) proposed in prior work. In contrastto the latter, our method reﬂects the 3D-geometry and provides a topologicalinterpretation of the learned coefﬁcients in the geometric neurons.

Geometric deep learning [1] starts with the geometry of a neuron. A perceptron model representinga hyperplane is, in this sense, a geometric entity of Euclidean or afﬁne space. Geometries beyondhyperplanes such as hyperspheres and curved manifolds require more complex models. The beneﬁtsof increasing the complexity of decision surfaces have been previously demonstrated in [2–4]. Toconstruct such surfaces, one needs to consider more general spaces, in which case Klein geometries[5] come in handy.The abstract concepts of Klein geometries can efﬁciently be formulated in the framework of Clifford(geometric) algebras [6]. It has therefore also been applied in neurocomputing in the past [7–12].A Clifford algebra can be viewed as a generalization of quaternions and complex numbers. Thisversatility, along with the geometric nature, is one of the main reasons for the renewed interest it hasrecently gained, e.g., in the context of complex-valued models [13, 14].In this paper, we employ a Clifford algebra as a tool to perform computations in conformal geometry.Building on top of the previous works on Clifford neurons [15], spherical decision surfaces [16], andmultilayer hypersphere perceptrons (MLHP) [17], we use the conformal embedding in Minkowskispace and propose an alternative conformal model called the multilayer geometric perceptron (MLGP)whose units are based on hypersphere neurons. Owing to the homogeneous representation [18],we provide an interpretation of the model parameters directly in the Euclidean space for an 8-class geometry classiﬁcation problem. This goes signiﬁcantly beyond prior work that offers aninterpretation based on distances to at most three classes. We show that our MLGP model is superiorto the baseline MLHP and an analogous vanilla multilayer perceptron (MLP) by comparing theirperformance, thus presenting the beneﬁts of utilizing geometric algebra in neural networks.

Preprint. Under review. a r X i v : . [ c s . L G ] J un mbeddingterms Input × Embedded × Vectorized Embedded Output

Figure 1: The proposed MLGP (conformal) model; embedding is performed at each layer vector-wise:the ﬁrst embedding term (yellow) is always set to − , the second (blue) is the scaled magnitude ofthe original vector, i.e., − ||x|| ; since the embedding of the input matrix is row-wise, the ﬁrst hiddenlayer units represent linear combinations of hyperspheres.We summarize our contributions as follows:( a ) We introduce the MLGP model whose units are geometric (conformal) neurons based on hy-persphere neurons, where we use the geometric embedding instead of an activation function; wedemonstrate that our model is superior to the vanilla MLP model and, in the presence of noise inthe data, to the baseline. ( b ) Using separate ranges of the rotation angle to create training and testsets, we show that our MLGP model generalizes over rigid body transformations , whereas the vanillaMLP has a signiﬁcant drop in generalization performance. ( c ) We give a topological interpretation ofthe learned model parameters: the idea of inverted decision surfaces grounded in the homogeneousrepresentation of hyperspheres, demonstrated in the Euclidean R space. Research on neural networks equivariant to certain symmetry groups has been expanding over thepast few years, e.g., SE(3)-equivariant models [19], [20], and the SO(3)-equivariant network [21]. AKlein geometry can be viewed as a homogeneous space together with a transformation (symmetry)group acting on it. Conformal geometry on the sphere is modeled as a Klein geometry with theunderlying space being the sphere S n and the Lorentz group [22] of an ( n + 2) -dimensional space(e.g., R n +1 , ) acting as the transformation group. We obtain the main inspiration for our work from the idea of modeling hyperspherical surfaces usinga conformal space representation introduced in [23] and exploited in [16]. The hypersphere neuronwith, as the name suggests, a hypersphere as decision surface is proposed as a variant of a Cliffordneuron in [3]. Therein a hyperspherical surface is shown as a generalization of a hyperplane. Amultilayer feed-forward neural network based on hypersphere neurons and called MLHP is designedin [17]. The authors describe how a certain amount of reduction in computational complexity can beachieved when using the MLHP model for some types of learning tasks.

A multilayer Clifford neural network, as well as the corresponding back-propagation derivation,is ﬁrst proposed and discussed in [8, 9]. As per [15], work on another multilayer model, two keyconcepts for Clifford neural machinery originate and are easy to analyze at the neuron level: theability to process various geometric entities and the concept of the geometric model. The latter acts ascertain transformations on the processed data and becomes inherent by choosing a particular Cliffordalgebra. The paper [15] also introduces the spinor Clifford neurons (SCN) with weights actinglike rotors from two sides. It is demonstrated how a single SCN can be used to compute Möbius2ransformations in a linear way: something unattainable by any real-valued network. Additionally,the paper describes Clifford-valued activation functions for all two-dimensional Clifford algebras.Less related to geometric problems are models of recurrent Clifford NNs, as originally discussed in[12], where their dynamics are studied from the perspective of the existence of energy functions.In contrast to prior work, our MLGP model exploits a different embedding scheme that reﬂects the3D-geometry. As a consequence, the ﬁrst hidden layer units consist of (linear) combinations ofhypersphere neurons and, along with the rest of the layers, do not necessarily require an activationfunction. Utilizing the chosen embedding strategy results in high performance in certain geometrictasks involving noisy data. We explain the details and discuss the advantages of our method in a latersection.

The conformal model, which we will present in detail in Section 4, is based on a particular embeddingof Minkowski space into a Euclidean vector space of higher dimensionality. The construction of thisembedding is reminiscent of the way in which projective n -space is embedded in R n +1 by meansof homogeneous coordinates. Some key aspects pertaining to the geometric interpretation will alsobe similar for these two embeddings. For this reason, we believe it is instructional to brieﬂy discussthe—perhaps more familiar—projective case, before discussing the conformal embedding. Cartesian coordinates are commonly used to represent various geometric objects and concepts inEuclidean and afﬁne spaces, both of which can be thought of as special subsets of projective spaces .By considering a projective space coordinatized by homogeneous coordinates , it turns out that thesolutions to some geometric problems in Euclidean and afﬁne geometry, as well as the resultingexpressions, become simpler [18].Homogeneous coordinates constitute an embedding of n -dimensional projective space P n ( R ) into ( n +1) -dimensional Euclidean space R n +1 , such that every x ∈ P n ( R ) corresponds to an equivalenceclass: [ x ] = (cid:8) y ∈ R n +1 : y = γ x , γ ∈ R \ { } (cid:9) . (1)Thus, every nonzero x ∈ R n +1 relates to a uniquely deﬁned x ∈ P n ( R ) , and we call x (or any otherelement of [ x ] ) homogeneous coordinates of this x . Points

The canonical form of the homogeneous coordinates of a point x = ( x , . . . , x n ) is obtainedas x = ( x , . . . , x n , , i.e., by appending an extra coordinate and setting it equal to one. Thisoperation is clearly reversible, allowing x to be recovered from its canonical homogeneous coordinates,and thus allowing an interpretation of x in the original geometric setting of x .In the process of performing geometric computations, one may obtain a homogeneous coordinatevector y , which is not in canonical form. In order to make a relevant geometric interpretation of sucha result, it is, in general, necessary to ﬁnd the corresponding canonical form. This is done through point normalization , which is achieved by dividing y by its ﬁnal coordinate, resulting in x = y y n +1 = (cid:16) y y n +1 , . . . , y n y n +1 , (cid:17) = ( x , . . . , x n , . (2) Hyperplanes

A hyperplane in R n with normal vector p = ( p , . . . , p n ) is represented by theequation p x + . . . + p n x n = ∆ for some ∆ . By introducing p = ( p , . . . , p n , − ∆) as (dual)homogeneous coordinates of the hyperplane, and using homogeneous coordinates x to represent x ,the hyperplane equation becomes ( p , . . . , p n , − ∆) · ( x , . . . , x n ,

1) = 0 ⇐⇒ p (cid:62) x = 0 . (3)In particular, if ∆ ≥ and (cid:107) p (cid:107) = 1 , then ∆ will be precisely the Euclidean distance from thehyperplane to the origin, and p will be the outward pointing unit normal. A geometric interpretationof p requires this particular canonical form , which is obtained through dual normalization (note thatthis works fundamentally differently from point normalization). If a hyperplane p and a point z are3iven by their respective canonical homogeneous representations, it can be shown that the shortestdistance between them is given by (cid:12)(cid:12) p (cid:62) z (cid:12)(cid:12) . Note that this agrees perfectly with (3).In the following section, the reader is encouraged to keep the following crucial observations in mind:(a) geometric objects may have more than one representation in a given embedding, (b) makingrelevant geometric interpretations of objects in the embedding requires proper normalization, (c) therelevant normalization can be fundamentally different depending on the particular type of objectunder consideration, and (d) scalar products in the embedding space can have particular signiﬁcance. Taking one step further from homogeneous coordinates opens up a whole new world in the form of conformal geometry [23], i.e., angle-preserving transformations on a space.

Minkowski space [22],named after H. Minkowski who introduced R , as a model of space time, is the real vector space ME n ≡ R n +1 , where the ﬁrst n + 1 basis vectors square to +1 and the last one to − . Minkowski R , space The relevance of Minkowski spaces for Euclidean geometry is well-described in terms of the Minkowski R , plane in [22]. Its orthonormal basis is deﬁned as { e + , e − } ,where e = 1 , e − = − , and e + · e − = 0 .A null basis can then be constructed as the two vectors { e , e ∞ } , where e = ( e − − e + ) is theorigin and e ∞ = e − + e + is the point at inﬁnity. Note the properties e = e ∞ = 0 and e · e ∞ = − . Conformal embedding

Given a vector in Euclidean space, x ∈ R n , one can construct the con-formal space as ME n ≡ R n +1 , = R n ⊕ R , . The embedding of x in the conformal space ME n represents the stereographic projection of x onto a projection sphere deﬁned in ME n as X = C ( x ) = x + 12 x e ∞ + e , (4)where X ∈ ME n is called normalized and x = x · x = || x || . Observe that X = 0 .From (4), we obtain the naming of the two null vectors: e = C ( ) and e ∞ = lim | x |→∞ x C ( x ) .The embedding (4) is homogeneous, i.e., all embedding vectors in the equivalence class [ X ] = (cid:8) Y ∈ R n +2 : Y = γX, γ ∈ R \ { } (cid:9) (5)are taken to represent the same vector x . This property is fundamental for the remainder of the paper. Scalar product in conformal space

Given Y = y + y e ∞ + e , the scalar product of twoembeddings in conformal space turns out to be the Euclidean distance, which constitutes the basis forderiving the hypersphere neuron [3]: X · Y = − ( x − y ) . A normalized hypersphere in ME n is a hypersphere S ∈ ME n with center c = ( c , . . . , c n ) ∈ R n embedded as C ∈ ME n , radius r ∈ R , and the coefﬁcient for e set to 1. It is deﬁned in the conformalspace as S = c + ( c − r ) e ∞ + e = C − r e ∞ . Keeping in mind that R n has basis ( e , . . . , e n ) , we obtain the scalar product of an embedded datavector X and a hypersphere S in ME n : X · S = X · C − r X · e ∞ = − ( x − c ) + r . Thatis, X · S = 0 ⇐⇒ | x − c | = | r | . Speciﬁcally, the scalar product shows where the input vector isrelative to the hypersphere: inside (positive product), on (zero), or outside (negative product) of thehypersphere.It has been shown in [3] that by embedding a data vector x = ( x , . . . , x n ) ∈ R n and the hypersphere S ∈ ME n in R n +2 as X = ( x , . . . , x n , − , − x ) ∈ R n +2 , S = ( c , . . . , c n ,

12 ( c − r ) , ∈ R n +2 , (6)with S further referred to as a normalized hypersphere (in R n +2 ), one can implement a hypersphereneuron in ME n as a standard dot product in R n +2 since X · S = x · c −

12 ( c − r ) − x = −

12 ( x − c ) + 12 r = X · S . (7)4

Conformal model

From now on and depending on the context, hypersphere refers to either a decision surface (geometricentity), or the (scaled) embedded vector S ∈ R n +2 (6), or a classiﬁer (the hypersphere neuron). In the MLHP [17], the model input is treated as a single real n -vector that is subsequently embeddedin R n +2 , as discussed in Section 3.3. However, such an embedding scheme is not the ultimate choicefor all types of learning problems. Consider a geometric problem where each input represents 3Dcoordinates of k points, e.g., a real k × array. That is, we will focus on the geometry of the Euclidean R space and its conformal embedding in ME ≡ R , , implemented in R (7).To preserve the input structure, we propose to apply the conformal embedding to the input point-wise , in contrast to performing the embedding on a vectorized input as in [17]. Nevertheless, eachintermediate layer output in our model is a one-dimensional array, z ∈ R m , that we embed in R m +2 .In order to propagate the embedded input through the initial linear layer, we vectorize the k × array row-wise into X ∈ R k . We illustrate the embeddings in Fig. 1. For a given (embedded) input X ∈ R k , a single unit, i.e., a geometric (conformal) neuron , (cid:101) S ∈ R k , in the ﬁrst layer represents k hyperspheres: one for each 3-vector in the k × input. Note that the proposed method works for anydimension other than three and, given one single input vector, the geometric neuron is identical to thehypersphere neuron [3].Notably, the embedding, which is non-linear and present at each layer, may eliminate the need foractivation functions implied by MLPs. The choice of the ﬁnal layer activation function depends solelyon the application and, therefore, is no different from the standard MLP case. Since we regard the model parameters as independent during training, as proposed by [3], our modellearns non-normalized hyperspheres (parameter vectors) of the form (cid:101) S = ( s , s , . . . , s n +2 ) ∈ R n +2 .To obtain normalized vectors S as in (6), one performs point normalization (2), i.e., divides allelements in the learned parameter vector (cid:101) S by the last one. We refer to this last element, s n +2 , as the scale factor , γ ∈ R . The scale factors can take arbitrary values.Due to the homogeneity of the hypersphere representation (5), both normalized and non-normalizedhyperspheres represent the same decision surface. As a result, we can alternatively describe theunit output as a weighted sum of the scalar products of k embedded input vectors and k normalizedhyperspheres that it represents, i.e., a weighted sum of outputs of the normalized hyperspheres. Inthis case, the coefﬁcients are given by the scale factors: z i = k (cid:88) j =1 γ j X (cid:62) j S j , (8)where z i ∈ R is the i th element in the hidden layer output; X j ∈ R is the j th X ∈ R k ;and S j = (cid:101) S j /γ j are the corresponding normalized learned parameters.Expressing the output signal in this manner allows us to analyze the learned decision surfaces in therespective Euclidean space. We discuss the important effect of having negative scale factors in a latersection. Also, radii of hyperspheres can be extracted from their normalized form (6), namely from thelast but one element. However, as all other parameters, this element can be learned freely. It even canbecome negative, representing a hypersphere with an imaginary radius. Although lacking geometricinterpretation, this can be beneﬁcial for the learning process [3]. We test our method on a geometry classiﬁcation problem and compare its performance with those ofanalogous baseline MLHP and vanilla MLP. 5 .1 3D shape classiﬁcation data chiral_shape_1chiral_shape_2squarelinecornerLTzigzag

Figure 2: The 3D Tetris data.We use the 3D Tetris dataset proposed in [20]. Itconsists of 8 shapes, displayed in Fig. 2. Each datasample is a × array, containing the 3D coordinatesof 4 points in a certain order. We refer to these 3D co-ordinates as canonical . Note that the dataset includestwo chiral shapes that are reﬂections of each other. Main dataset

For the ﬁrst experiment, we augmentthe Tetris data by performing uniform random rota-tion in [0 , π ) (about random axis) and translation in ( − , , i.e., rigid body transformation, of the canon-ical shapes. This way, we form a training set consist-ing of 1000 shapes, a validation set containing 9000samples, and a test set of size 90000. Theta-split

In order to test the capabilities of generalization over rigid body transformation in ourmodel, we create a theta-split dataset. The rotation angle, θ , in the dataset construction differs for thetraining and validation/test sets: θ train is drawn from the uniform distribution over the joint interval (cid:2) , π (cid:1) ∪ (cid:2) π, π (cid:1) and θ val and θ test from the antipodal interval. Data with noise

An important practical consideration for model comparison is that real-world dataoften contain a certain amount of noise. Therefore, we add distortion, n , of different levels to theshape coordinates in the main and theta-split datasets: n ∼ U ( − a, a ) with a ∈ { . , . } . We thusobtain four additional datasets. When building models for the experiments, we want the total number of parameters in them tobe comparable. However, since the decision surfaces in the conformal (MLGP) and the baseline(MLHP) models are of a higher order of complexity compared to the vanilla MLP case, this results ina different number of hidden units. We select the vanilla model with 6 hidden units (134 parameters),the baseline MLHP model with 5 hidden units (126 parameters), and our MLGP with 4 hidden units(128 parameters). Note that the vanilla model includes bias parameters, whereas the other two do not.We try different activation functions for all models: the sigmoid, hyperbolic tangent (tanh), ReLU,and identity, i.e., no activation function. In the case of MLGP and MLHP, identity means that the onlysource of non-linearity is the embedding. The ﬁnal layer of all models in our experiments is equippedwith the softmax activation function. We implement all models in PyTorch [24] and use the defaultparameter initialization for linear layers. We train the models for 20000 epochs by minimizing thecross-entropy loss function with the Adam optimizer [25] supplied with the default hyperparameters:the learning rate is set to . , β = 0 . , and β = 0 . . We run each experiment 50 times. Ateach run, we generate the datasets (training and validation sets) described in Section 5.1. The testdata are generated once for each experiment.We train and test the models on the main and theta-split datasets, both with different levels of noise.Since the three vanilla models with the identity, sigmoid, and tanh activation functions, respectively,are inferior to that with ReLU, we show only the latter case and proceed with ReLU as an activationfunction for the vanilla model. Our MLGP method and the baseline model perform much betterwithout activation functions, which motivates us to use this conﬁguration. The results are reﬂectedin Fig. 3. The performances of the models on the test data and in all experiments are presented inTable 1. The superiority of our conformal MLGP model with no activation function other than the embeddingand even fewer number of parameters (128 vs. 134) to the plain MLP is evident from Fig. 36

Main dataset

MLGP_4hu_trainMLGP_4hu_valbaseline_5hu_trainbaseline_5hu_valvanilla_6hu_relu_trainvanilla_6hu_relu_val 10000 12000 14000 16000 18000 20000iterations0.20.40.60.81.0

Theta-split

MLGP_4hu_trainMLGP_4hu_valbaseline_5hu_trainbaseline_5hu_valvanilla_6hu_relu_trainvanilla_6hu_relu_val

Noise a = 0.1

MLGP_4hu_trainMLGP_4hu_valbaseline_5hu_trainbaseline_5hu_valvanilla_6hu_relu_trainvanilla_6hu_relu_val 10000 12000 14000 16000 18000 20000iterations0.20.40.60.81.0

Noise a = 0.2

MLGP_4hu_trainMLGP_4hu_valbaseline_5hu_trainbaseline_5hu_valvanilla_6hu_relu_trainvanilla_6hu_relu_val

Figure 3: Top: model accuracies on the main and theta-split data. Bottom: model accuracies on thetheta-split data with different noise levels. Shown are the mean and standard deviation over 50 runson the training and validation sets.Table 1: Model accuracies on the test data (mean and standard deviation over 50 runs, %); values inparentheses represent the accuracy of the 10 best models selected based on the validation accuracy.

Main dataset Theta-splitNoise a = 0 . a = 0 . a = 0 . a = 0 . a = 0 . a = 0 . Vanilla MLP . ± . . ± . . ± . . ± . . ± . . ± . . ± .

6) (71 . ± .

4) (65 . ± .

8) (63 . ± .

1) (56 . ± .

2) (53 . ± . Baseline [17] ± . ± . . ± . . ± . . ± . . ± . ± ) (86 . ± .

4) (72 . ± .

5) (88 . ± .

6) (83 . ± .

8) (70 . ± . MLGP (ours) . ± . ± ± ± ± ± (92 . ± .

2) ( ± ) ( ± ) ( ± ) ( ± ) ( ± ) and Table 1. However, this advantage comes at a price of increased computational complexity.Considering that in the embedding step we have to evaluate the magnitude of m vectors at each layer,we can roughly compare it to adding m extra neurons to the respective layer in an analogous vanillaMLP, in accordance with the complexity analysis given in [3].In all noisy data experiments, albeit with higher variance, our MLGP demonstrates, on average,better generalization than the baseline and vanilla models (see Table 1). The noted variations of thevalidation accuracy are presumably due to confusing two or three classes. We, therefore, select ten7odels of each type with the highest validation accuracy. Our method has the best correlation ofvalidation accuracy and test accuracy, which further increases its advantage and reduces its variance.We notice the signiﬁcant drop in the generalization performance of the vanilla model in the case ofthe theta-split data, whereas the generalization accuracy of our method decreases insigniﬁcantly, asindicated by Table 1 and the corresponding plots in Fig. 3. Thus, we empirically show that our MLGPmodel has the capability of generalization over rigid body transformations. To a slightly lesser extent,the same applies to the baseline model, which has however not been discussed in the original work[17]. Overall, the experiments show that our method may not only be a more geometric, but also abetter generalizing approach for certain learning problems. squarecornerdecision spheres Figure 4: A single hidden unit in our MLGP modelclassifying the Tetris shapes: each spherical decisionsurface classiﬁes the corresponding point of the inputsignal; the unit output is then a linear combination of thescalar products (8); the arrows specify the positive di-rection of the scalar product, i.e., inside or outside the sphere.One of the main advantages of theembedding scheme utilized in ourconformal model is that it providesa topological interpretation of thelearned coefﬁcients. To visualize thelearned decision surfaces in the Eu-clidean R space, we need to point-normalize them according to (6). Weuse two shapes from the main datasetdescribed in Section 5.1 as inputto the trained MLGP model. Wedemonstrate the input shapes and thefour spherical decision surfaces rep-resented by the third hidden unit inFig. 4.Note that each sphere may have a dif-ferent scale factor. It can be negativeand, thanks to the normalization step,turn inside out the decision surface fora given input. Importantly, this swapis itself a conformal transformation.If the sign of the scalar product of aninput vector X and a hypersphere S is the same regardless of the normal-ization of the latter, the input is cate-gorized as class I if it is inside or onthe surface of the hypersphere, and toclass O if it is outside. We refer to such a hyperspherical classiﬁer as an I -hypersphere . Otherwise,the hypersphere is referred to as an O -hypersphere . We illustrate the idea of the inverted decisionsurfaces by drawing O -spheres in red, whereas I -spheres are shown in blue (see Fig. 4). In this work, we propose the multilayer geometric perceptron (MLGP) whose units are geometric (con-formal) neurons that are based on hypersphere neurons. We show that the proposed conformal modeloutperforms a standard MLP in classifying geometric shapes while having even fewer parametersand no activation function other than the embedding. Furthermore, we empirically demonstrate thatour method has the capability of generalization over rigid body transformations. In the presence ofnoise in the data, our model is, on average, superior to the baseline multilayer hypersphere perceptron(MLHP). Owing to the homogeneous representation of the hyperspheres and the chosen embeddingscheme, we provide the topological interpretation of the learned model parameters and hyperspherescale factors in the Euclidean space. In future work, we plan to further investigate the conformalMLGP model. In particular, we intend to analyze the local minima causing confusion of classes in asmall number of models and how to avoid them. 8 roader Impact

In a longer perspective, many learning problems involving geometry will beneﬁt from the ﬁndings inthis paper: the embedding will greatly contribute to generalization and adherence to constraints. Thiswork might supersede prior work on similar problems that, e.g., attempt to achieve invariance andgeneralization by data augmentation.With our method, whenever training fails to ﬁnd a correct model, the overall performance is signiﬁ-cantly reduced. Currently, this is addressed by training several models and disregarding those witha signiﬁcant drop in validation performance. We assume that local minima are responsible for thiseffect. We are not aware of any biases in the data that our method leverages other than a confusion ofclasses, which is strongly correlated with the similarity of point conﬁgurations.As the nature of the present paper is purely theoretical, no direct ethical issues arise, nor are there anyimmediate societal consequences.

Acknowledgments and Disclosure of Funding

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program(WASP), by the Swedish Research Council through a grant for the project Algebraically ConstrainedConvolutional Networks for Sparse Image Data (2018-04673), and by the Centre for IndustrialInformation Technology (CENIIT).

References [1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: goingbeyond euclidean data,”

IEEE Signal Processing Magazine , vol. 34, no. 4, pp. 18–42, 2017.[2] S. Buchholz and G. Sommer, “A hyperbolic multilayer perceptron,” in

Proceedings of the IEEE-INNS-ENNSInternational Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challengesand Perspectives for the New Millennium , vol. 2, pp. 129–133, IEEE, 2000.[3] V. Banarer, C. Perwass, and G. Sommer, “The hypersphere neuron,” in

ESANN , pp. 469–474, 2003.[4] H. Lipson and H. T. Siegelmann, “Clustering irregular shapes using high-order neurons,”

Neural Computa-tion , vol. 12, no. 10, pp. 2331–2353, 2000.[5] R. W. Sharpe,

Differential geometry: Cartan’s generalization of Klein’s Erlangen program , vol. 166.Springer Science & Business Media, 2000.[6] G. Sommer,

Geometric computing with Clifford algebras: theoretical foundations and applications incomputer vision and robotics . Springer Science & Business Media, 2013.[7] E. Hitzer, T. Nitta, and Y. Kuroe, “Applications of Clifford’s geometric algebra,”

Advances in AppliedClifford Algebras , vol. 23, no. 2, pp. 377–404, 2013.[8] J. Pearson and D. Bisset, “Back propagation in a Clifford algebra,” in

Proc. Int. Conf. Artiﬁcial NeuralNetworks, I. Aleksander and J. Taylor (Ed.) , vol. 2, p. 413–416, 1992.[9] J. Pearson and D. Bisset, “Neural networks in the Clifford domain,” in

Proceedings of 1994 IEEEInternational Conference on Neural Networks (ICNN’94) , vol. 3, pp. 1465–1469, IEEE, 1994.[10] E. J. Bayro-Corrochano, “Geometric neural computing,”

IEEE Transactions on Neural Networks , vol. 12,no. 5, pp. 968–986, 2001.[11] J. R. Vallejo and E. Bayro-Corrochano, “Clifford Hopﬁeld neural networks,” in , pp. 3609–3612, IEEE, 2008.[12] Y. Kuroe, “Models of Clifford recurrent neural networks and their dynamics,” in

The 2011 InternationalJoint Conference on Neural Networks , pp. 1035–1041, IEEE, 2011.[13] M. Tygert, J. Bruna, S. Chintala, Y. LeCun, S. Piantino, and A. Szlam, “A mathematical motivation forcomplex-valued convolutional networks,”

Neural computation , vol. 28, no. 5, pp. 815–825, 2016.

14] O. Moran, P. Caramazza, D. Faccio, and R. Murray-Smith, “Deep, complex, invertible networks forinversion of transmission effects in multimode optical ﬁbres,” in

Advances in Neural Information ProcessingSystems , pp. 3280–3291, 2018.[15] S. Buchholz and G. Sommer, “On Clifford neurons and Clifford multi-layer perceptrons,”

Neural Networks ,vol. 21, no. 7, pp. 925–935, 2008.[16] C. Perwass, V. Banarer, and G. Sommer, “Spherical decision surfaces using conformal modelling,” in

JointPattern Recognition Symposium , pp. 9–16, Springer, 2003.[17] V. Banarer, C. Perwass, and G. Sommer, “Design of a multilayered feed-forward neural network usinghypersphere neurons,” in

International Conference on Computer Analysis of Images and Patterns , pp. 571–578, Springer, 2003.[18] R. Hartley and A. Zisserman,

Multiple view geometry in computer vision . Cambridge university press,2003.[19] M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. S. Cohen, “3d steerable cnns: Learning rotationallyequivariant features in volumetric data,” in

Advances in Neural Information Processing Systems , pp. 10381–10392, 2018.[20] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley, “Tensor ﬁeld networks:Rotation-and translation-equivariant neural networks for 3d point clouds,” arXiv preprint arXiv:1802.08219 ,2018.[21] B. Anderson, T. S. Hy, and R. Kondor, “Cormorant: Covariant molecular neural networks,” in

Advances inNeural Information Processing Systems , pp. 14510–14519, 2019.[22] H. Li, D. Hestenes, and A. Rockwood, “Generalized homogeneous coordinates for computational geometry,”in

Geometric Computing with Clifford Algebras , pp. 27–59, Springer, 2001.[23] H. Li, D. Hestenes, and A. Rockwood, “A universal model for conformal geometries of Euclidean, sphericaland double-hyperbolic spaces,” in

Geometric computing with Clifford algebras , pp. 77–104, Springer,2001.[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, et al. , “Pytorch: An imperative style, high-performance deep learning library,” in

Advances inNeural Information Processing Systems , pp. 8024–8035, 2019.[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2015., 2015.