[PDF] CapProNet: Deep Feature Learning via Orthogonal Projections onto Capsule Subspaces

Abstract

Full PDF

CCapProNet: Deep Feature Learning via OrthogonalProjections onto Capsule Subspaces

Liheng Zhang † , Marzieh Edraki † , and Guo-Jun Qi †‡∗† Laboratory for MA chine P erception and LE arning,University of Central Florida http://maple.cs.ucf.edu ‡ Huawei Cloud, Seattle, USA

Abstract

In this paper, we formalize the idea behind capsule nets of using a capsule vectorrather than a neuron activation to predict the label of samples. To this end, wepropose to learn a group of capsule subspaces onto which an input feature vector isprojected. Then the lengths of resultant capsules are used to score the probabilityof belonging to different classes. We train such a Capsule Projection Network(CapProNet) by learning an orthogonal projection matrix for each capsule sub-space, and show that each capsule subspace is updated until it contains input featurevectors corresponding to the associated class. We will also show that the capsuleprojection can be viewed as normalizing the multiple columns of the weight matrixsimultaneously to form an orthogonal basis, which makes it more effective inincorporating novel components of input features to update capsule representations.In other words, the capsule projection can be viewed as a multi-dimensional weightnormalization in capsule subspaces, where the conventional weight normalizationis simply a special case of the capsule projection onto 1D lines. Only a smallnegligible computing overhead is incurred to train the network in low-dimensionalcapsule subspaces or through an alternative hyper-power iteration to estimate thenormalization matrix. Experiment results on image datasets show the presentedmodel can greatly improve the performance of the state-of-the-art ResNet back-bones by − and that of the Densenet by − respectively at the samelevel of computing and memory expenses. The CapProNet establishes the com-petitive state-of-the-art performance for the family of capsule nets by signiﬁcantlyreducing test errors on the benchmark datasets. Since the idea of capsule net [15, 9] was proposed, many efforts [8, 17, 14, 1] have been made toseek better capsule architectures as the next generation of deep network structures. Among them arethe dynamic routing [15] that can dynamically connect the neurons between two consecutive layersbased on their output capsule vectors. While these efforts have greatly revolutionized the idea ofbuilding a new generation of deep networks, there are still a large room to improve the state of the artfor capsule nets.In this paper, we do not intend to introduce some brand new architectures for capsule nets. Instead,we focus on formalizing the principled idea of using the overall length of a capsule rather than ∗ Corresponding author: G.-J. Qi, email: [email protected] and [email protected] Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. a r X i v : . [ c s . C V ] O c t single neuron activation to model the presence of an entity [15, 9]. Unlike the existing idea inliterature [15, 9], we formulate this idea by learning a group of capsule subspaces to represent a setof entity classes. Once capsule subspaces are learned, we can obtain set of capsules by performing anorthogonal projection of feature vectors onto these capsule subspaces.Then, one can adopt the principle of separating the presence of an entity and its instantiationparameters into capsule length and orientation, respectively. In particular, we use the lengthsof capsules to score the presence of entity classes corresponding to different subspaces, whiletheir orientations are used to instantiate the parameters of entity properties such as poses, scales,deformations and textures. In this way, one can use the capsule length to achieve the intra-class invariance in detecting the presence of an entity against appearance variations, as well as model the equivalence of the instantiation parameters of entities by encoding them into capsule orientations[15].Formally, each capsule subspace is spanned by a basis from the columns of a weight matrix in theneural network. A capsule projection is performed by projecting input feature vectors fed from abackbone network onto the capsule subspace. Speciﬁcally, an input feature vector is orthogonallydecomposed into the capsule component as the projection onto a capsule subspace and the complementcomponent perpendicular to the subspace. By analyzing the gradient through the capsule projection,one can show that a capsule subspace is iteratively updated along the complement component thatcontains the novel characteristics of the input feature vector. The training process will continue untilall presented feature vectors of an associated class are well contained by the corresponding capsulesubspace, or simply the back-propagated error accounting for misclassiﬁcation caused by capsulelengths vanishes.We call the proposed deep network with the capsule projections the CapProNet for brevity. TheCapProNet is friendly to any existing network architecture – it is built upon the embedded featuresgenerated by some neural networks and outputs the projected capsule vectors in the subspacesaccording to different classes. This makes it amenable to be used together with existing networkarchitectures. We will conduct experiments on image datasets to demonstrate the CapProNet cangreatly improve the state-of-the-art results by sophisticated networks with only small negligiblecomputing overhead.

Brieﬂy, we summarize our main ﬁndings from experiments upfront about the proposed CapProNet. • The proposed CapProNet signiﬁcantly advances the capsule net performance [15] by re-ducing its test error from . and . on CIFAR10 and SVHN to . and . respectively upon the chosen backbones. • The proposed CapProNet can also greatly reduce the error rate of various backbone networksby adding capsule projection layers into these networks. For example, The error rate canbe reduced by more than − based on Resnet backbone, and by more than − based on densenet backbone, with only < and 0.04% computing and memory overheadin training the model compared with the backbones. • The orthogonal projection onto capsule subspaces plays a critical role in delivering com-petitive performance. On the contrary, simply grouping neurons into capsules could notobviously improve the performance. This shows the capsule projection plays an indispens-able role in the CapProNet delivering competitive results. • Our insight into the gradient of capsule projection in Section 2.3 explains the advantage ofupdating capsule subspaces to continuously contain novel components of training examplesuntil they are correctly classiﬁed. We also ﬁnd that the capsule projection can be viewed as ahigh-dimensional extension of weight normalization in Section 2.4, where the conventionalweight normalization is merely a simple case of the capsule projection onto 1D lines.The source code is available at https://github.com/maple-research-lab .The remainder of this paper is organized as follows. We present the idea of the Capsule ProjectionNet (CapProNet) in Section 2, and discuss the implementation details in Section 3. The review ofrelated work follows in Section 4, and the experiment results are demonstrated in Section 5. Finally,we conclude the paper and discuss the future work in Section 6.2

The Capsule Projection Nets

In this section, we begin by shortly revisiting the idea of conventional neural networks in classiﬁcationtasks. Then we formally present the orthogonal projection of input feature vectors onto multiplecapsule subspaces where capsule lengths are separated from their orientations to score the presenceof entities belonging to different classes. Finally, we analyze the gradient of the resultant capsuleprojection by showing how capsule subspaces are updated iteratively to adopt novel characteristics ofinput feature vectors through back-propagation.

Consider a feature vector x ∈ R d generated by a deep network to represent an input entity. Given itsground truth label y ∈ { , , · · · , L } , the output layer of the deep network aims to learn a group ofweight vectors { w , w , · · · , w L } such that w Ty x > w Tl x , for all , l (cid:54) = y. (1)This hard constraint is usually relaxed to a differentiable softmax objective, and the backpropagationalgorithm is performed to train { w , w , · · · , w L } and the backbone network generating the inputfeature vector x . Unlike simply grouping neurons to form capsules for classiﬁcation, we propose to learn a groupof capsule subspaces {S , S , · · · , S L } , each associated with one of L classes. Suppose we havea feature vector x ∈ R d generated by a backbone network from an input sample. Then, to learna proper feature representation, we project x onto these capsule subspaces, yielding L capsules { v , v , · · · , v L } as projections. Then, we will use the lengths of these capsules to score theprobability of the input sample belonging to different classes by assigning it to the one according tothe longest capsule.Formally, for each capsule subspace S l of dimension c , we learn a weight matrix W l ∈ R d × c the columns of which form the basis of the subspace, i.e., S l = span( W l ) is spanned by thecolumn vectors. Then the orthogonal projection v l of a vector x onto S l is found by solving v l = arg min v ∈ span( W l ) (cid:107) x − v (cid:107) . This orthogonal projection problem has the following closed-form solution v l = P l x , and P l = W l W + l where P l is called projection matrix for capsule subspace S l , and W + l is the Moore-Penrosepseudoinverse [4].When the columns of W l are independent, W + l becomes ( W Tl W l ) − W Tl . In this case, since weonly need the capsule length (cid:107) v l (cid:107) to predict the class of an entity, we have (cid:107) v l (cid:107) = (cid:113) v Tl v l = (cid:113) x T P Tl P l x = (cid:113) x T W l Σ l W Tl x (2)where Σ l = ( W Tl W l ) − can be seen as a normalization matrix applied to the transformed featurevector W Tl x as a way to normalize the W l -transformation based on the capsule projection. As wewill discuss in the next subsection, this normalization plays a critical role in updating W l along theorthogonal direction of the subspace so that novel components pertaining to the properties of inputentities can be gradually updated to the subspace.In practice, since c (cid:28) d , the c columns of W l are usually independent in a high-dimensional d -Dspace. Otherwise, one can always add a small (cid:15) I to W Tl W l to avoid the numeric singularity whentaking the matrix inverse. Later on, we will discuss a fast iterative algorithm to compute the matrixinverse with a hyper-power sequence that can be seamlessly integrated with the back-propagationiterations. A projection matrix P for a subspace S is a symmetric idempotent matrix (i.e., P T = P and P = P )such that its range space is S . .3 Insight into Gradients In this section, we take a look at the gradient used to update W l in each iteration, which can give ussome insight into how the CapProNet works in learning the capsule subspaces.Suppose we minimize a loss function (cid:96) to train the capsule projection and the network. For simplicity,we only consider a single sample x and its capsule v l . Then by the chain rule and the differential ofinverse matrix [13], we have the following gradient of (cid:96) wrt W l ∂(cid:96)∂ W l = ∂(cid:96)∂ (cid:107) v l (cid:107) ∂ (cid:107) v l (cid:107) ∂ W l = ∂(cid:96)∂ (cid:107) v l (cid:107) ( I − P l ) xx T W + Tl (cid:107) v l (cid:107) (3)where the operator ( I − P l ) can be viewed as the projection onto the orthogonal complement of thecapsule subspace spanned by the columns of W l , W + Tl denotes the transpose of W + l , and the factor ∂(cid:96)∂ (cid:107) v l (cid:107) is the back-propagated error accounting for misclassiﬁcation caused by (cid:107) v l (cid:107) .Denote by x ⊥ (cid:44) ( I − P l ) x the projection of x onto the orthogonal component perpendicular to thecurrent capsule subspace S l . Then, the above gradient ∂(cid:96)∂ W l only contains the columns parallel to x ⊥ (up to coefﬁcients in the vector x T W + Tl ). This shows that the basis of the current capsule subspace S l in the columns of W l is updated along this orthogonal component of the input x to the subspace.One can regard x ⊥ as representing the novel component of x not yet contained in the current S l , itshows capsule subspaces are updated to contain the novel component of each input feature vectoruntil all training feature vectors are well contained in these subspaces, or the back-propagated errorsvanish that account for misclassiﬁcation caused by (cid:107) v l (cid:107) .Figure 1: This ﬁgure illustrates a 2-D capsule sub-space S spanned by two basis vectors w and w .An input feature vector x is decomposed into thecapsule projection v onto S and an orthogonalcomplement x ⊥ perpendicular to the subspace. Inone training iteration, two basis vectors w and w are updated to w (cid:48) and w (cid:48) along the orthogonaldirection x ⊥ , where x ⊥ is viewed as containingnovel characteristics of an entity not yet containedby S .Figure 1 illustrates an example of 2-D capsulesubspace S spanned by two basis vectors w and w . An input feature vector x is decom-posed into the capsule projection v onto S andan orthogonal complement x ⊥ perpendicular tothe subspace. In one training iteration, two basisvectors w and w are updated to w (cid:48) and w (cid:48) along the orthogonal direction x ⊥ , where x ⊥ isviewed as containing novel characteristics of anentity not yet contained by S . As discussed in the last subsection and Figure 2,we can explain the orthogonal components rep-resent the novel information in input data, andthe orthogonal decomposition thus enables usto update capsule subspaces by more effectivelyincorporating novel characteristics/componentsthan the classic capsule nets.One can also view the capsule projection asnormalizing the column basis of weight matrix W l simultaneously in a high-dimensional cap-sule space. If the capsule dimension c is setto 1, it is not hard to see that Eq. (2) becomes v l = | W Tl x |(cid:107) W l (cid:107) , which is analogous to the conventional weight normalization of the vector W l ∈ R d .Thus, the weight normalization is a special case of the proposed CapProNet in D case. By increasingthe dimension c , the capsule projection can be viewed as normalizing the multiple columns of theweight matrix W l to form an orthogonal basis, and as we explained above, this can more effectivelyincorporate novel components of input feature vectors to update capsule representations.4 Implementation Details

We will discuss some implementation details in this section, including 1) the computing cost toperform capsule projection and a fast iterative method by using hyper-power sequences withoutrestart; 2) the objective to train the capsule projection.

Taking a matrix inverse to get the normalization matrix Σ l would be expensive with an increasingdimension c . But after the model is trained, it is ﬁxed in the inference with only one-time computing.Fortunately, the dimension c of a capsule subspace is usually much smaller than the feature dimension d that is usually hundreds and even thousands. For example, c is typically no larger than inexperiments. Thus, taking a matrix inverse to compute these normalization matrices only incursa small negligible computing overhead compared with the training of many other layers in a deepnetwork.Alternatively, one can take advantage of an iterative algorithm to compute the normalization matrix.We consider the following hyper-power sequence Σ l ← Σ l − Σ l W Tl W l Σ l which has proven to converge to ( W T W ) − with a proper initial point [2, 3]. In stochastic gradientmethod, since only a small change is made to update W l in each training iteration, thus it is oftensufﬁcient to use this recursion to make an one-step update on the normalization matrix from the lastiteration. The normalization matrix Σ l can be initialized to ( W Tl W l ) − at the very ﬁrst iteration togive an ideal start. This could further save computing cost in training the network.In experiments, a very small computing overhead was incurred in the capsule projection. For example,training the ResNet110 on CIFAR10/100 costed about . seconds per iteration on a batch of images. In comparison, training the CapProNet with a ResNet110 backbone in an end-to-end fashiononly costed an additional < . seconds per iteration, that is less than computing overheadfor the CapProNet compared with its backbone. For the inference, we did not ﬁnd any noticeablecomputing overhead for the CapProNet compared with its backbone network. Given a group of capsule vectors { v , v , · · · , v L } corresponding to a feature vector x and its groundtruth label y , we train the model by requiring (cid:107) v y (cid:107) > (cid:107) v l (cid:107) , for all , l (cid:54) = y. In other words, we require (cid:107) v y (cid:107) should be larger than all the length of the other capsules. Asa consequence, we can minimize the following negative logarithmic softmax function (cid:96) ( x , y ) = − log exp( (cid:107) v y (cid:107) ) (cid:80) Ll =1 exp( (cid:107) v l (cid:107) ) to train the capsule subspaces and the network generating x through back-propagation in an end-to-end fashion. Once the model is trained, we will classify a test sample intothe class with the longest capsule. The presented CapProNets are inspired by the CapsuleNets by adopting the idea of using a capsulevector rather than a neural activation output to predict the presence of an entity and its properties[15, 9]. In particular, the overall length of a capsule vector is used to represent the existence of theentity and its direction instantiates the properties of the entity. We formalize this idea in this paper byexplicitly learning a group of capsule subspaces and project embedded features onto these subspaces.The advantage of these capsule subspaces is their directions can represent characteristics of an entity,which contains much richer information, such as its positions, orientations, scales and textures,than a single activation output. By performing an orthogonal projection of an input feature vectoronto a capsule subspace, one can ﬁnd the best direction revealing these properties. Otherwise, theentity is thought of being absent as the projection vanishes when the input feature vector is nearlyperpendicular to the capsule subspace. 5

Experiments

We conduct experiments on benchmark datasets to evaluate the proposed CapProNet compared withthe other deep network models.

We use both CIFAR and SVHN datasets in experiments to evaluate the performance.

CIFAR

The CIFAR dataset contains , and , images of × pixels for the trainingand test sets respectively. A standard data augmentation is adopted with horizonal ﬂipping andshifting. The images are labeled with 10 and 100 categories, namely CIFAR10 and CIFAR100datasets. A separate validation set of , images are split from the training set to choose the modelhyperparameters, and the ﬁnal test errors are reported with the chosen hyperparameters by trainingthe model on all , training images. SVHN

The Street View House Number (SVHN) dataset has , and , images of coloreddigits in the training and test sets, with an additional , training images available. Followingthe widely used evaluation protocol in literature [5, 11, 12, 16], all the training examples are usedwithout data augmentation, while a separate validation set of , images is split from the trainingset. The model with the smallest validation error is selected and the error rate is reported. ImageNet

The ImageNet data-set consists of 1.2 million training and 50k validation images. Weapply mean image subtraction as the only pre-processing step on images and use random cropping,scaling and horizontal ﬂipping for data augmentation [6]. The ﬁnal resolution of both train andvalidation sets is × , and k images are chosen randomly from training set for tuning hyperparameters. We test various networks such as ResNet [6], ResNet (pre-activation) [7], WideResNet [18] andDensenet [10] as the backbones in experiments. The last output layer of a backbone network isreplaced by the capsule projection, where the feature vector from the second last layer of the backboneis projected onto multiple capsule subspaces.The CapProNet is trained from the scratch in an end-to-end fashion on the given training set. For thesake of fair comparison, the strategies used to train the respective backbones [6, 7, 18], such as thelearning rate schedule, parameter initialization, and the stochastic optimization solver, are adopted totrain the CapProNet. We will denote the CapProNet with a backbone X by CapProNet+X below.

We perform experiments with various networks as backbones for comparison with the proposedCapProNet. In particular, we consider three variants of ResNets – the classic one reported in [11] with110 layers, the ResNet with pre-activation [7] with 164 layers, and two paradigms of WideResNets[18] with and layers, as well as densenet-BC [10] with layers. Compared with ResNetand ResNet with pre-activation, WideResNet has fewer but wider layers that reaches smaller errorrates as shown in Table 1. We test the CapProNet+X with these different backbone networks toevaluate if it can consistently improve these state-of-the-art backbones. It is clear from Table 1that the CapProNet+X outperforms the corresponding backbone networks by a remarkable margin.For example, the CapProNet+ResNet reduces the error rate by , . and on CIFAR10,CIFAR100 and SVHN, while CapProNet+Densenet reduces the error rate by . , . and . respectively. Finally, we note that the CapProNet signiﬁcantly advances the capsule net performance[15] by reducing its test error from . and . on CIFAR10 and SVHN to . and . respectively based on the chosen backbones.We also evaluate the CapProNet with Resnet50 and Resnet101 backbones for single crop Top-1/Top-5results on ImageNet validation set. To ensure fair comparison, we retrain the backbone networksbased on the ofﬁcal Resnet model , where both original Resnet[6] and CapProNet are trained withthe same training strategies on four GPUs. The results are reported in Table 2, where CapProNet+X https://github.com/tensorﬂow/models/tree/master/ofﬁcial/resnet CapProNet+ResNet (c=4) 110 1.7M 5.27 22.45 1.82CapProNet+ResNet (c=8) 110 1.7M

ResNet (pre-activation) [7] 164 1.7M 5.46 24.33 -CapProNet+ResNet (pre-activation c=4) 164 1.7M 4.88 21.37 -CapProNet+ResNet (pre-activation c=8) 164 1.7M 4.89 -WideResNet [18] 16 11.0M 4.81 22.07 -28 36.5M 4.17 20.50 -with Dropout 16 2.7M - - 1.64CapProNet+WideResNet (c=4) 16 11.0M 4.20 21.33 -28 36.5M -28 36.5M 3.85 -with Dropout 16 2.7M - -

Densenet-BC k=12 [10] 100 0.8M 4.51 22.27 1.76CapProNet Densenet-BC k=12 (c=4) 100 0.8M 4.35 21.22 1.64CapProNet Densenet-BC k=12 (c=8) 100 0.8M

Table 2: The CapProNet results with Resnet50 and Resnet101 backbones for Single crop top-1/top-5error rate on ImageNet validation set with image resolution of × , as well as the comparisonwith original baseline results. Method reported result[6] our rerun CapProNet (c=2) CapProNet (c=4) CapProNet (c=8)Resnet50 24.8 / 7.8 24.09/7.13 23.282 / 6.8 23.265 / / 6.78Resnet101 23.6 / 7.1 22.81 /6.67 22.192 / 6.178 successfully outperforms the original backbones on both Top-1 and Top-5 error rates. It is worthnoting the gains are only obtained with the last layer of backbones replaced by the capsule projectlayer. We believe the error rate can be further reduced by replacing the intermediate convolutionallayers with the capsule projections, and we leave it to our future research.We also note that the CapProNet+X consistently outperforms the backbone counterparts with varyingdimensions c of capsule subspaces. In particular, with the WideResNet backbones, in most cases, theerror rates are reduced with an increasing capsule dimension c on all datasets, where the smallesterror rates often occur at c = 8 . In contrast, while CapProNet+X still clearly outperforms bothResNet and ResNet (pre-activation) backbones, the error rates are roughly at the same level. This isprobably because both ResNet backbones have a much smaller input dimension d = 64 of featurevectors into the capsule projection than that of WideResNet backbone where d = 128 and d = 160 with and layers, respectively. This turns out to suggest that a larger input dimension can enableto use capsule subspaces of higher dimensions to encode patterns of variations along more directionsin a higher dimensional input feature space.To further assess the effect of capsule projection, we compare with the method that simply groupsthe output neurons into capsules without performing orthogonal projection onto capsule subspaces.We still use the lengths of these resultant “capsules" of grouped neurons to classify input imagesand the model is trained in an end-to-end fashion accordingly. Unfortunately, this approach, namelyGroupNeuron+ResNet in Table 3, does not show signiﬁcant improvement over the backbone network.For example, the smallest error rate by GroupNeuron+ResNet is . at c = 2 , a small improvement7igure 2: These ﬁgures plot the -D capsule subspaces and projected capsules corresponding toten classes on CIFAR10 dataset. In each ﬁgure, red capsules represent samples from the classcorresponding to the subspace, while green capsules belong to a different class. It shows red sampleshave larger capsule length (relative to the origin) than those of green samples. This validates thecapsule length as the classiﬁcation criterion in the proposed model. Note that some ﬁgures havedifferent scales in two axes for a better illustration.over the error rate of . reached by ResNet110. This demonstrates the capsule projection makes anindispensable contribution to improving model performances.When training on CIFAR10/100 and SVHN, one iteration typically costs ∼ . seconds for Resnet-110, with an additional less than . second to train the corresponding CapProNet. That is lessthan computing overhead. The memory overhead for the model parameters is even smaller. Forexample, the CapProNet+ResNet only has an additional − parameters at c = 2 comparedwith . M parameters in the backbone ResNet. We do not notice any large computing or memoryoverheads with the ResNet (pre-activation) or WideResNet, either. This shows the advantage ofCapProNet+X as its error rate reduction is not achieved by consuming much more computing andmemory resources.

Table 3: Comparison between GroupNeuron andCapProNet with the ResNet110 backbone on CI-FAR10 dataset. The best results are highlighted inbold for c = 2 , , capsules. It shows the need ofcapsule projection to obtain better results.c GroupNeuron CapProNet2 6.26 To give an intuitive insight into the learned cap-sule subspaces, we plot the projection of inputfeature vectors onto capsule subspaces. Insteadof directly using P l x to project feature vec-tors onto capsule subspaces in the original inputspace R d , we use ( W Tl W l ) − W Tl x to projectan input feature vector x onto R c , since thisprojection preserves the capsule length (cid:107) v l (cid:107) deﬁned in (2).Figure 2 illustrates the -D capsule subspaceslearned on CIFAR10 when c = 2 and d = 64 inCapProNet+ResNet110, where each subspacecorresponds to one of ten classes. Red pointsrepresent the capsules projected from the class of input samples corresponding to the subspace whilegreen points correspond to one of the other classes. The ﬁgure shows that red capsules have largerlength than green ones, which suggests the capsule length is a valid metric to classify samples intotheir corresponding classes. Meanwhile, the orientation of a capsule reﬂects various instantiations ofa sample in these subspaces. These ﬁgures visualize the separation of the lengths of capsules fromtheir orientations in classiﬁcation tasks. 8 Conclusions and Future Work

In this paper, we present a novel capsule project network by learning a group of capsule subspaces fordifferent classes. Speciﬁcally, the parameters of an orthogonal projection is learned for each class andthe lengths of projected capsules are used to predict the entity class for a given input feature vector.The training continues until the capsule subspaces contain input feature vectors of correspondingclasses or the back-propagated error vanishes. Experiment results on real image datasets show thatthe proposed CapProNet+X could greatly improve the performance of backbone network withoutincurring large computing and memory overheads. While we only test the capsule projection as theoutput layer in this paper, we will attempt to insert it into intermediate layers of backbone networks aswell, and hope this could give rise to a new generation of capsule networks with more discriminativearchitectures in future.

Acknowledgements

L. Zhang and M. Edraki made equal contributions to implementing the idea: L. Zhang conductedexperiments on CIFAR10 and SVHN datasets, and visualized projections in capsule subspaces onCIFAR10. M. Edraki performed experiments on CIFAR100. G.-J. Qi initialized and formulated theidea, and prepared the paper.

References [1] Mohammad Taha Bahadori. Spectral capsule networks. , 2018.[2] Adi Ben-Israel. An iterative method for computing the generalized inverse of an arbitrary matrix.

Mathe-matics of Computation , 19(91):452–455, 1965.[3] Adi Ben-Israel and Dan Cohen. On iterative computation of generalized inverses and associated projections.

SIAM Journal on Numerical Analysis , 3(3):410–419, 1966.[4] Adi Ben-Israel and Thomas NE Greville.

Generalized inverses: theory and applications , volume 15.Springer Science & Business Media, 2003.[5] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxoutnetworks. arXiv preprint arXiv:1302.4389 , 2013.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.In

European Conference on Computer Vision , pages 630–645. Springer, 2016.[8] Geoffrey Hinton, Nicholas Frosst, and Sara Sabour. Matrix capsules with em routing. 2018.[9] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In

InternationalConference on Artiﬁcial Neural Networks , pages 44–51. Springer, 2011.[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

CVPR , volume 1, page 3, 2017.[11] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochasticdepth. In

European Conference on Computer Vision , pages 646–661. Springer, 2016.[12] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets.In

Artiﬁcial Intelligence and Statistics , pages 562–570, 2015.[13] Kaare Brandt Petersen et al. The matrix cookbook.[14] David Rawlinson, Abdelrahman Ahmed, and Gideon Kowadlo. Sparse unsupervised capsules generalizebetter. arXiv preprint arXiv:1804.06094 , 2018.[15] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In

Advances inNeural Information Processing Systems , pages 3859–3869, 2017.[16] Pierre Sermanet, Soumith Chintala, and Yann LeCun. Convolutional neural networks applied to housenumbers digit classiﬁcation. In

Pattern Recognition (ICPR), 2012 21st International Conference on , pages3288–3291. IEEE, 2012.[17] Dilin Wang and Qiang Liu. An optimization view on dynamic routing between capsules. , 2018.

18] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 ,2016.,2016.