Subspace Capsule Network
SSubSpace Capsule Network
Marzieh Edraki, ∗ Nazanin Rahnavard,
Mubarak Shah Center for Research in Computer Vision Department of Electrical and Computer EngineeringUniversity of Central FloridaOrlando, Florida, USA, [email protected], [email protected], [email protected]
Abstract
Convolutional neural networks (CNNs) have become a keyasset to most of fields in AI. Despite their successful per-formance, CNNs suffer from a major drawback. They fail tocapture the hierarchy of spatial relation among different partsof an entity. As a remedy to this problem, the idea of cap-sules was proposed by Hinton. In this paper, we propose theSubSpace Capsule Network (
SCN ) that exploits the idea ofcapsule networks to model possible variations in the appear-ance or implicitly-defined properties of an entity through agroup of capsule subspaces instead of simply grouping neu-rons to create capsules. A capsule is created by projecting aninput feature vector from a lower layer onto the capsule sub-space using a learnable transformation. This transformationfinds the degree of alignment of the input with the propertiesmodeled by the capsule subspace.We show that
SCN is a general capsule network that cansuccessfully be applied to both discriminative and genera-tive models without incurring computational overhead com-pared to CNN during test time. Effectiveness of
SCN is eval-uated through a comprehensive set of experiments on su-pervised image classification, semi-supervised image classi-fication and high-resolution image generation tasks using thegenerative adversarial network (GAN) framework.
SCN sig-nificantly improves the performance of the baseline modelsin all 3 tasks.
In the recent years, convolutional neural networks (CNNs)have become a key asset to most of fields in AI. Various tasksin computer vision, reinforcement learning, natural languageand speech processing systems have achieved significant im-provement by using them. New applications like music gen-eration (Dong et al. 2018), visual text correction (Mazaheriand Shah 2018), online fashion recommendation (Han et al.2017) are founded on the feature learning capability of CNNarchitectures. Despite their successful performance, CNNssuffer from a major drawback. They fail to capture the hier-archy of spatial relation among different parts of an entity.As a remedy to this problem, Hinton et al. introduced the ∗ Corresponding authorCopyright c (cid:13) idea of
Capsule Networks (Hinton, Krizhevsky, and Wang2011). Capsule networks received a flurry of attention afterachieving the state-of-the-art performance on image classifi-cation (Sabour, Frosst, and Hinton 2017), text classification(Zhao et al. 2018), action detection and localization (Duarte,Rawat, and Shah 2018), image segmentation tasks (LaLondeand Bagci 2018), etc. Moreover, many efforts have beenmade to improve the structure of capsule networks (Hinton,Sabour, and Frosst 2018) (Bahadori 2018)(Zhang, Edraki,and Qi 2018) as a new generation of deep neural networks.A capsule is defined as a group of neurons that can ul-timately model different properties such as pose, texture ordeformation of an entity or a part of an entity. Each layerof a capsule network consists of many capsules. In a well-trained capsule network, activation vector of each capsulerepresents the instantiation parameters of the entity and thelength of the capsule scores the presence of that feature orpart of that entity In this paper, while we still follow themain definition of capsules, we propose
Subspace CapsuleNetworks ( SCNs ), which build capsules based on the degreeof relevance of an input feature vector to a group of learnedsubspaces . In
SCN s, corresponding to each entity or partof that entity, a specific capsule subspace is learned. Thena capsule is created by projecting the input feature vectoronto the capsule subspace using a learned transformation,defined based on the basis of the corresponding capsule sub-space. Intuitively speaking, a capsule subspace captures thevariation in visual properties; like appearance, pose, textureand deformation; of an object or an implicitly defined featureof that object The length of the output vector of a subspacecapsule represents the degree of alignment of an input withthe properties modeled by that subspace. Hence, if a sub-space capsule has a large activity vector, it means that theinput feature vector is highly related to the entity modeledby that subspace and vice versa. This form of creating sub-space capsules makes it independent of any form of routingrequired in capsule network introduced in (Sabour, Frosst,and Hinton 2017) or (Hinton, Sabour, and Frosst 2018). Dueto this property,
SCN is easily scalable to large network ar-chitectures and large datasets.Closest to our work is the CapProNet model, proposedby (Zhang, Edraki, and Qi 2018), in which authors apply a r X i v : . [ c s . C V ] F e b ubspace-based capsules merely in the last layer of an im-age classification network and only require capsule lengthfor prediction. In the classification task with N classes, agroup of capsule subspaces {S , ... S N } are learned. Thenthe capsule corresponding to each class is created by orthog-onal projection of the input feature vector from backbonemodel onto the learned subspace. The input image belongsto the class with the largest capsule length. In SCN , unlikeCapProNet we are interested in both subspace capsules andnorm of capsules.The summary of our contributions is as follows: • The proposed
SCN is a general capsule model thatcan be used without any change in the formulationin both generative models as well as discriminativemodels . • SCN is computationally efficient with no computa-tional overhead during test phase and a negligiblecomputational overhead with help of the method in-troduced in Section (5) during training, compared tothe baselines. • When applied in generator model of a GAN,
SCN consistently improves the relative FID score of gen-erated samples by at least in all of our experi-ments. • SCN achieves state-of-the-art performance in semi-supervised classification of CIFAR10 and SVHdatasets and improves the relative error rate of thebaseline models by at least for these 2 datasets. • SCN is easily scalable to large architectures anddatasets like ImageNet. When applies on the lastblock of the Resnet model, it decreases the Top-1 er-ror rate by relatively.The rest of the paper is organized as follows. We first brieflyreview some of the related studies with capsule networksand GAN models in Section 2. Subspace Capsule Networkis formally presented in Section 3 followed by studying theeffects of SCN on the GAN framework in Section 4. Imple-mentation details are discussed in Section 5. We evaluatethe performance of
SCN in Section 6 and the conclusion ispresented in Section 7.
The idea of capsule networks was first introduced in Trans-forming auto-encoders, where Hinton et al. pointed out thatCNNs cannot achieve viewpoint invariance just by lookingat the activity of a single neuron, and a more complex struc-ture like a capsule is necessary. Output of a capsule is a vec-tor that summarizes information about an entity or part ofthat entity. The main advantages of capsule networks is thatthe part-whole relation can be captured through the capsulesof consecutive layers. (Sabour, Frosst, and Hinton 2017) de-fine a capsule as a group of neurons, whose orientation of itsoutput vector represents the instantiation parameters of a vi-sual entity modeled by that capsule and its length representsthe probability of entity’s existence. They use dynamic rout-ing between capsules to capture the part-whole relationship.Dynamic routing works based on measuring the agreement Figure 1:
MNIST samples are generated by varying each dimen-sion of capsules of first layer by a value in the range of [ − . , . of two capsule in consecutive layers using scalar product oftheir capsule vectors. The subsequent paper (Hinton, Sabour,and Frosst 2018) extends the idea of capsule by separating itinto a × pose matrix and an activation probability. Dy-namic routing is updated to EM-routing algorithm, whichis a more efficient way in measuring the agreement betweencapsules. The new capsule structure leads to the state-of-the-art performance in classification task of SmallNorb dataset.(Singh et al. 2019) use capsule idea in low-resolution imagerecognition. The idea of 3D capsules introduced in (Duarte,Rawat, and Shah 2018) to tackle action detection and local-ization problem. CapProNet (Zhang, Edraki, and Qi 2018)proposes learning a group of capsule subspaces in the finallayer of a CNN for the image classification task. Capsulenetwork was also applied on medical image segmentationtask by (LaLonde and Bagci 2018) and achieve competitiveresults based on the convolution-deconvolution capsule net-work structure.The common point among all of these studies is that theyall try to solve a discriminative task, like classification, im-age segmentation, and action detection using a capsule net-work. There have been a few attempts in using the capsulearchitecture in a generative model. CapsuleGAN (Jaiswal etal. 2018) applies capsule network in the discriminator ofa GAN to improve the quality of generated samples andCapsPix2Pix (Bass et al. 2019) uses convolution capsulesto synthesise images conditioned on segmentation labels topre-train segmentation models for the medical image seg-mentation task. Since the introduction of GANs by (Good-fellow et al. 2014), many efforts have been made to improvethe stability of training and quality of generated samples.Among them Wasserstein loss with gradient penalty (Ar-jovsky, Chintala, and Bottou 2017)(Gulrajani et al. 2017)and Spectral Normalization (Miyato et al. 2018a) success-fully stabilize the training process by enforcing Lipschitzcontinuity on discriminator and ProgressiveGAN (Karras etal. 2017) and BigGAN (Brock, Donahue, and Simonyan2018) generate high-quality samples by improving the ar-chitecture.In this paper, we exploit the inherent property of subspacecapsules, which is modeling the variation in appearance of avisual entity in a GAN model, to produce diverse and high-quality image samples. We also show the superiority of ourproposed model in semi-supervised image classification us-ing the GAN framework and also supervised image classifi-cation. Subspace Capsule Networks
In this section, we formalize the idea of subspace capsulenetworks (
SCNs ) by presenting their main components. Ineach layer, a
SCN learns a group of capsule subspaces, eachof which captures possible variations of an implicitly definedvisual entity. An input from the lower layer is projected ontoeach of these capsule subspaces to create new capsules. Ifthe input and a capsule subspace are related; for instance,the input is a part of the entity represented by a capsule sub-space; the output vector (projection of the input vector onto the the corresponding subspace) will be large. Moreover,the orientation of a capsule vector represents the propertiesof that entity.Since the key component of a
SCN is finding the level ofalignment of input feature vectors and capsule subspaces, weelaborate on the proposed projection matrix and formulatesubspace capsules. Then, capsule activation functions arepresented followed by subspace capsule convolution layerand the idea of subspace capsule mean pooling.
Projection onto a Capsule Subspace
For the layer k , suppose x ∈ R d is an input feature vectorfrom a lower layer k − . Suppose a capsule subspace S with dimensions c is formed as the span of the columns ofthe weight matrix W ∈ R d × c , where c (cid:28) d .The most straight-forward way to find the degree of align-ment of feature vector x and capsule subspace S is to or-thogonally project x onto subspace S . This problem has aclosed-form solution as follows y = W ( W T W ) − W T (cid:124) (cid:123)(cid:122) (cid:125) P x , (1)where P ∈ R d × d is the matrix of orthogonal projection onto S , and y ∈ R d is the projection of x onto S . The larger thelength of y , the more correlated x and capsule subspace S are. In other words, x has more of the properties modeledby S .However, the projection matrix P ∈ R d × d has the majordrawback of being a square matrix. This means that if wecreate a capsule by projecting the feature vector x ∈ R d ontocapsule subspace S using P , that capsule is still in the d -dimensional space. Practically speaking, if d is large, whichis usually the case in deep models, having different capsuletypes using the orthogonal projection matrix P would beimpossible since it demands a lots of memory. To be able tobenefit from various sizes of capsules through the sequenceof subspace capsule layers, one needs a transformation thatallows the input feature vector x to be mapped onto the c -dimensional space of capsule subspace, while it still pre-serves the relation among the capsules in the consecutivelayers of network. We propose to employ an intermediatedomain indicated by a transformation matrix P c ∈ R c × d inorder to exploit capsule subspaces. This matrix is derived bydecomposing the orthogonal projection matrix P as P = P d P c , (2) where P d = W ( W T W ) − / , (3a) P c = ( W T W ) − / W T . (3b)Here P c is the transformation that maps the input featurevector x into the c -dimensional capsule space , and P d isthe transformation that projects vectors in the capsule spaceback to the original d -dimensional space of input featurevector x . Now, the capsule that corresponds to the capsulesubspace S can be created by projecting feature vector x onto the c -dimensional capsule space as u = P c x . (4)Here, u indicates the low-dimensional representation of x inthe capsule space. Matrix P is a semi-definite and symmet-ric matrix. Thus its decomposition as suggested in Equation(2) has special properties. We claim that a capsule createdusing P c has the same information about the instantiationparameters and also the score of presence of features, as itwould be created by transformation P . Proof of this claimfollows from the Theorem 1. Theorem 1
Let P , as defined in (1), denote an orthogonalprojection matrix onto the subspace spanned by columns ofthe weight matrix W ∈ R d × c . Assume P is decomposed intotwo matrices P d and P c as in (2). Then, the transformationmatrix P d is an isomorphic transformation between R c and R d , i.e., ∀ u ∈ R c , (cid:107) u (cid:107) = (cid:107) P d u (cid:107) . The following can be concluded from Theorem 1. • The norm of the capsule vector u defined in Equation (4)represents the score of features modeled by S in the inputfeature vector x , since (cid:107) u (cid:107) = (cid:107) y (cid:107) , where || . || denotesthat l -norm of a vector. • For two input feature vectors x and x , the relation be-tween their corresponding capsules u and u is the sameas the relation between y and y . For instance the anglebetween u and u is the same as the angle between y and y . Activation Function
We apply two types of activation functions on subspace cap-sule based on our interpretation of the length of the out-put vector of a capsule. The length of the output vector ofa capsule can be interpreted from the confidence perspec-tive. A high confidence for a capsule shows that the inputfeature vector is highly aligned with the capsule subspace.In other words, the input feature vector contains the entitythat is modeled by capsule subspace. We also want to sup-press the effect of noisy capsules of layer L on activatingthe capsules of the next layer. Following this perspective wepropose “ sparking ” function given by v = max ( (cid:107) u (cid:107) − b , u (cid:107) u (cid:107) , (5)where b is a parameter that can be learned. This c -dim space is defined by the span of right singular vec-tors of W .a) Generator(b) Discriminator Figure 2: a) In the generator, the latent representation z is projected onto 10 capsule sbspaces with dimension c = 16 in the first layer.The capsule with largest vector is selected and reshaped to a cube of × × , then up-sampled to double the spatial resolution to × .This cube goes through 2 layers of sc-conv with 8 capsule types of 16 and 8 capsule dimensions, respectively, each followed by upsamplingoperation to get to the resolution of × . The final sc-conv layer has 8 subspace capsule types each with 8 dimension. The output of thislayer is fed to a transposed convolution layer to generate the final image. b) The SCN architecture of discriminator component of GAN forSVHN dataset. Features are extracted using 6 convolutional layers, followed by 3 subspace capsule convolution (SC-conv) layers each with64 subspace capsule types, one subspace capsule mean pool(SC-mean pool) layer and final subspace capsule fully connected (SC-Fc) layerwith 10 capsule types. Intuitively, the proposed activation function tries to in-crease the capsule certainty, if x is related to the entity mod-eled by capsule subspace S , or completely turn the capsuleoff if the length of it is below the threshold b . We initialize b = 0 . in our experiments and update it along with net-work parameters through the training process using stochas-tic gradient decent method.Another possibility is to relate the probability of the pres-ence of an entity modeled by a capsule subspace by thelength of the output capsule. For that, we follow (Sabour,Frosst, and Hinton 2017) and use squashing function definedas v = (cid:107) u (cid:107) (cid:107) u (cid:107) u (cid:107) u (cid:107) . (6)We found sparking function is more effective in discrim-inative tasks, i.e., in our (semi-)supervised classification ofimages; since it outputs sparse feature maps by turning offnoisy capsules which leads to faster convergence. Noisy cap-sules in each layer are those capsules represent the proper-ties that are not related to the input image and would havea small activity vector. While in generative models, havingsmall but non-zero values by applying squashing activationfunction on capsules leads to the higher quality of generatedsamples. Subspace Capsule Convolution
SCN can also benefit from the idea of weight sharing ofCNNs by using the same subspace capsule types in all spa-tial locations of an image.In subspace capsule convolution, if the input x has i fea-ture maps, and we want to create a c dimensional subspace capsule convolution kernel with receptive field of k , we needto build the transformation matrix P c as defined in Equa-tion (3b), based on a weight matrix W ∈ R ( i × k × k ) × c . Wecan treat each row of the projection matrix P c as one con-volution kernel of size ( i × k × k ) , that convolves over in-put feature maps and generates a single element of outputcapsule. So if P c gets reorganized into a 4-dimensional ten-sor with the shape of ( c × i × k × k ) , then it can be usedas the kernel of regular convolution operation and the cap-sule corresponds to each spatial location would be placedalong the output feature maps. Now, if we want to have n subspace capsule types, we can create a group of projectionmatrices { P c , ..., P c n } , after reorganizing each of them toa 4-dimensional tensor, and then concatenate them to createa kernel of shape ( nc × i × k × k ) . From now on, we repre-sent the kernel of a Subspace capsule convolution layer witha tuple of ( n, c, k, k ) . Subspace Capsule Mean Pooling
The idea of mean pooling comes naturally after subspacecapsule convolutions. In subspace capsule convolution, cap-sules of the same type represent the same visual propertyregardless of spatial positions. So it is a safe assumption thatcapsules of the same type in a small receptive field of k × k have similar orientation and a single capsule with mean ofthose capsule vectors can represent all of them. Gradient Analysis
Subspace capsule networks are trained based on the stochas-tic gradient descent methods. So analyzing the gradient thats used to update W in each step clarifies how SCN learnscapsule subspaces.Assume we have a loss function L and we want to dif-ferentiate it with respect to the weight matrix W , the basisof subspace S , through the projection onto subspace (Equa-tion 3b). For the sake of simplicity we first assume a − dimensional capsule subspace, i.e, c = 1 . Using the chainrule the gradient is given by: ∇ W L = 1 (cid:107) W (cid:107) ( I − P ) ∇ P c L, (7)where ∇ P c L is the gradient with respect to the projectionmatrix P c and it is computed the same way as the gradientwith respect to the kernel of a convolution operation. Theterm ( I − P ) is the projection matrix onto the orthogonalcomplement of subspace S . This shows that the basis of cap-sule subspace S spanned by the columns of W only updatedalong the orthogonal complement of S up to the scale (cid:107) W (cid:107) .The orthogonal complement of S can contain those novelfeatures from x that are not yet captured by S .This nice property of gradient can extend to higher di-mensional subspaces. Using the chain rule and derivative ofinverse of a matrix (Petersen et al. ) the gradient is as fol-lows: ∇ W ij L = ( W T W ) − s Tij ( I − P ) ∇ P c L, (8)where s ij is a single non-zero entry matrix correspondingto the gradient of W with respect to one of its elements inposition ( i, j ) . The general case also supports our conclu-sion from the special case since ( W T W ) − only stretchthe space along the basis of subspace by the scale factor ofeigenvalues of ( W T W ) − . So far we have defined all the building blocks of a subspacecapsule network. Next, we want to discuss how
SCN canbe effective in enhancing the performance of GANs. WhenGAN models are used in semi-supervised learning tasks, likeimage classification, the discriminator can benefit from
SCN ability by modeling the possible variations of visual proper-ties; for instance texture, pose, color corresponding to an en-tity using a group of capsule subspaces through a sequenceof subspace capsule layers. By creating the capsule usingprojection of input feature vector onto these capsule sub-spaces, and considering the length of capsules as confidenceabout the presence of those properties that are modeled bysubspaces, the discriminator can be made invariant with re-spect to the possible deformations of each visual property.GAN models can also leverage the ability of subspace cap-sule layer in the generator network. A subspace capsule gen-erator consists of multiple subspace capsule layers and eachlayer has multiple subspace capsule types. When trained,each subspace capsule type models all the possible varia-tion of a visual entity. Now the goal of the generator in eachlayer is to find the related properties and features that needto be added to the generated image so far. In addition us-ing
SCN as generator leads to more diverse generated sam-ples since in each layer, properties are sampled from sub-spaces that ensure the disentanglement of variation along their basis. In other word, each dimension of a subspace cap-sule has unique effect on the generated samples. Figure (1)showcases this property of
SCN . Each row represents onefeature like rotation, thickness of stroke, scale of generateddigits and samples are generated by tweaking one dimen-sion of capsules of the first layer of generator in the rangeof [ − . , . . The generated samples in each row are di-verse, and we can move over the appearance manifold ofeach digit by changing the value of capsule dimension. Fig-ure 2(a) shows the architecture of SCN generator with de-tailed training process explained in Section 6.
The projection matrix P c as defined in Equation (3b) in-volves taking the inverse of the square root of matrix W T W , two very computationally expensive operations. Ifnot being properly implemented, these operations can hin-der the training process. In this work, we use an stable ex-tension of Denman-Beavers iterative method (Denman andBeavers Jr 1976). It is known that for any symmetric pos-itive (semi-)definite matrix A , there exists a unique sym-metric positive (semi-)definite square root matrix. Higham et al. proposed in (Higham 1997) an iterative process thatconverges to the square root of such matrices. This iterativeprocess is presented below: Initialize Y = A and Z = I .For k = 0 , , , ...Y k +1 = 12 Y k (3 I − Z k Y K ) ,Z k +1 = 12 (3 I − Z k Y K ) Z k , (9)where k is iteration number. It has been shown that Y k and Z k converge to A and A − , respectively. This process onlyrequires matrix multiplication, which fits the best for paral-lel computation on GPUs. Further, it computes the inverse ofsquare root of matrix W T W simultaneously. In all of ourexperiments we set the number of iterations as k = 20 . Thisiterative process increases the training time negligibly com-pared to the total training time. For instance, in our trainingof SCN for large resolution images that all layers of genera-tor are replaced by
SCN convolution layers, the training timeis increased to 0.0529 sec/img compared to 0.047 sec/imgfor the baseline.It is worth noting that when training process completes,the capsule projection matrix P c is fixed and there is no timeoverhead for this iterative process. In this section, we demonstrate the superiority of
SCN s onthree tasks : Supervised classification of image data, semi-supervised classification of image data and generating high-quality images on multiple datasets. Datasets : We use CI-FAR10 (Krizhevsky and Hinton 2009), Street View HouseNumber (SVHN) (Netzer et al. 2011), ImageNet (Deng etal. 2009), CelebA (Liu et al. 2015), and 3 categories of Lsundataset, namely bedroom, cat and horse, throughout our ex-periments. Code:http://github.com/MarziEd/SubSpace-Capsule-Networkethods CIFAR10 SVHN N l = 4000 N l =1000Improved GAN (Salimans et al. 2016) . ± .
32 8 . ± . ALI (Dumoulin et al. 2016) . ± .
69 7 . ± . LSAL (Edraki and Qi 2018) . ± .
31 5 . ± . VAT (Miyato et al. 2018b) . ± .
13 6 . ± . SCN . ± .
21 4 . ± . Table 1:
Classification errors on CIFAR-10 and SVHN datasetscompared with the state-of-the-art methods. The error rates with N l = 4000 and N l = 1000 labeled training examples are reported. SCNs for Classification
Semi-supervised classification : For semi-supervised clas-sification, we evaluate the performance of the
SCN modelon two benchmark datasets of CIFAR10 and SVHN throughthe GAN framework. To have a fair comparison with thestate-of-the-art methods, we use the same network architec-ture and loss functions for generator and discriminator as themodel proposed by (Salimans et al. 2016)
SVHN : In semi-supervised classification of the SVHNdataset, we replace the last 4 layers of the discriminator withsubspace capsule layers. Figure 2(b) shows the architectureof
SCN discriminator. An input image passes through 6 con-volutional layers that produce 128 feature maps of size × .These feature maps go through three subspace capsule con-volution layers, each layer has 64 different capsule types of2-dimensional subspace. The first subspace capsule convo-lution layer has the kernel size of × and the last two havekernel size of × . We apply the sparking function on allthree layers. We feed the capsules of the last subspace cap-sule convolution layer to a subspace capsule mean poolinglayer, with receptive field of × , that results in 64 cap-sule types of size 2, followed by the final subspace capsulefully connected layer with 10, 4-dimensional subspace cap-sule types. The input image belongs to the class with thelargest norm of the output capsule. CIFAR10 : For the CIFAR10 dataset, the architecture ofdiscriminator is similar to that of SVHN, except the sub-space capsule convolution layers have 96 capsule types ofsize 2. The generator architecture for both datasets are thesame as baseline architecture (Salimans et al. 2016).We train the network using Adam optimizer with initiallearning rate of . with β = 0 . and β = 0 . . Wehold out a set of 5000 training samples as our validation setfor subspace capsule dimension selection, and fine tune thewhole model on all training samples afterward.Table 1 compares the performance of SCN modelon semi-supervised image classification of CIFAR10 andSVHN for 4000 and 1000 labeled samples, respectively.
Supervised classification : We evaluate the scalability of
SCN on large datasets like ImageNet. We also comparethe performance of
SCN with capsule network proposed bySabour et al. . on CIFAR10 dataset.
ImageNet:
For ImageNet dataset, the last 4 layers of theResnet model with depth of 34 have been replaced withSCN layers, batch normalization layers in the final blockand also the final residual connection are removed. Mean
Model Depth SC-Fc SC-Conv Top1 Top5Resent 34 - - 27.13 8.84SCN 34 (1000,4) (256,2)
SCN 34 (1000,4) (128,4) 25.96 8.35
Table 2:
Single crop, Top1 and Top 5 error rate of ImageNet clas-sification with Resnet backbone model. In SC-Fc and SC-Convcolumns, in a tuple ( n, c ) , n is the number of capsule types and c is the subspace capsule dimension.Method CelebA Bedroom Horse CatProgGAN (Karras et al. 2017) . ∗ . ∗ .
11 37 . SCN
Table 3:
Comparison of FID score of
SCN with our baselinemodel. Entries with ∗ are our rerun of the baseline. pooling is replaced by the SCN mean pooling. The modelwas trained using SGD with momentum rate of 0.9 for 100epochs. The learning rate is initialized as 0.1 and decayedevery 30 epochs with the rate of 0.1. Table 2 shows that SCN outperforms the baseline model and reduces the relative top-1 classification error of Resnet by . CIFAR10:
For supervised classification of CIFAR10, wealso update the convolution layers of the last bottleneckblock of Resnet model with 110 layers to SCN convolutionlayers. Each of them has 32 capsule types with subspace cap-sule dimension c = 2 . Batch normalization layers and resid-ual connection of this block has been removed. Mean pool-ing is replaced by SCN mean pooling and the final fully con-nected layer is replaced by SCN fully connected layer with10 capsule types with subspace capsule dimension c = 4 .This model archives . error rate that significantly out-performs capsule network model (Sabour, Frosst, and Hin-ton 2017) with . error rate. It also improved the relativeerror rate of the Resent model by . by reducing it from . to . SCNs for Image Generation
We evaluate the effectiveness of subspace capsule networkson the image generation task using the GAN framework forvarious size of images and datasets. In all of our experi-ments, we build the generator based on subspace capsulenetworks and the discriminator based on CNNs.
MNIST : The
SCN architecture of generator is shown inFigure 2(a). The first layer has 10 subspace capsule types.Each of them is a − dimesional capsule subspace. The out-put of the first layer is 10 subspace capsules. The capsulewith the largest output vector is selected and reshaped to a (2 × × tensor. This tensor goes through a bilinearlyupsampling layer to double the spatial size and a subspacecapsule convolution layer with kernel size of (8 , , , .The third layer has the same structure of upsampleing andsubspace capsule convolution layer as the second layer ex-cept that it has the kernel size of (8 , , , . This is followedby the last subspace capsule convolution layer with kernelsize of (8 , , , . The final layer is a transposed convolu-tion layer with the receptive field of (5 × with stride of 2 a) CelebA(b) LSUN-bedroom(c) LSUN-horse(d) LSUN-cat Figure 3: Generated samples for various datasets. (a) CelebA (128 × , (b) Bedroom (128 × , (c) Horse (256 × .(d) Cat (256 × .followed by sigmoid activation function. All subspace cap-sule convolution layers have stride 1 and squashing activa-tion function. The discriminator architecture is composed of4 convolution layers with receptive field of (5 × and strideof 2. We apply batch normalization to all convolutional lay-ers and the activation function is leaky Relu with slope of . . This is followed by a global mean pooling and a fullyconnected layer to 10 output classes.We follow AC-GAN (Odena, Olah, and Shlens 2017), andadd an auxiliary classification loss to ensure that each cap-sule subspace in the first layer of generator captures the vari-ation of a single class. To this end, we use the index of thecapsule with the maximum length in the first layer as theground truth label for the generated sample. We train thismodel using Adam optimizer with initial learning rate of0.0002 for 25 epochs. High-Resolution Images : We also apply
SCN for gen-erating high-resolution images of size and forCelebA and 3 classes of LSUN datasets. To have a fair com-parison with the state-of-the-art models, we build SCN gen-erative model based on the model proposed by (Karras etal. 2017). Karras et al. suggest to use progressive growingof generator and discriminator models for generating highresolution images. The training starts from a low resolutionof × and gradually a new block for higher resolution isadded to the both generator and discriminator models. Thisprocess continues until the networks get to the final reso-lution of images. Each block in this model consists of anup-sampling step and a convolutionl layer.For CelebA and LSUN bedroom datasets we generate samples with resolution of . In the generator model, weupdate all the convolutional layers from resolution to res-olution to SCN convolution, and Relu activation functionis replaced by squashing activation function. The higher res-olution blocks of and are remained intact. For LSUNcat and LSUN horse datasets, we generate samples of size . In the generator network, we replace all convolutionallayers for all resolutions with SCN convolution layers fol-lowed by squashing activation function. Table 4 presents theconfiguration of subspace convolutional layers for all exper-iments. We use the tuple notation of ( n, c, k, k ) to denote asubspace capsule convolution layer with n capsule types, c -dimensional capsule subspaces and a receptive field of k × k .To stabilize training process we adopt Wasserestien losswith gradient penalty. We also benefit from progressivegrowing through training process. For all of the experiments,the discriminator network is the same as the baseline archi-tecture. We compare the generated samples quantitativelywith the state-of-the-art model using Frchet Inception Dis-tance (FID). We believe the FID metric is the closest one tothe human judgment, since it compares the statistic of fea-tures extracted; using Inception model; from generated sam-ples with real samples. Comparison of numerical values ofthis metric for all datasets are presented in Table 3. In all 4datasets, SCN consistently improved the relative FID scoreof generated samples by at least . Figure (3) shows gen-erated samples for these datasets.
Interpolation of Latent Representation : To verify that
SCN generator does not merely memorize training samples,we also walk through the manifold space. To this end weigure 4: In each row, the first and last samples in the red boxes are generated using two independent noise vectors. Theintermediate samples are generated by walking through the linear interpolant of those two noise vectors.
LR CelebA, bedroom cat, horseFR=128 FR=2564 (4, 128, 3, 3) (8, 64, 3, 3)8 (4, 128, 3, 3) (8, 64, 3, 3)16 (4, 64, 3, 3) (8, 64, 3, 3)32 (2, 64, 3, 3) (8, 64, 3, 3)64 - (8, 32, 3, 3)128 - (4, 32, 3, 3)256 - (2, 32, 3, 3)
Table 4: Configuration of subspace capsule convolution lay-ers for the generator networks. “LR” and “FR” stand for thelayer resolution and final image resolution respectively.choose two random latent representation z and z , then weuse SCN generator to generate samples for z s on the linearinterpolant of z and z . Figure (4) shows the interpolatedsamples for LSUN-horse and LSUN-cat datasets. As it canbe seen the transition between pairs of latent representationsare smooth and meaningful. Ablation Study
In this section we analyze the effect of subspace capsule sizeand also position in the network on performance.Table 5 reports the semi-supervised classification errorrate of SVHN dataset with 1000 labeled training samples,when we update the last fully-connected or convolution lay-ers with various size capsules. Configuration 0 demonstratesthe result of the baseline model (Salimans et al. 2016), thefirst three rows after that correspond to the settings whensubspace capsules are only applied on the last layer withvarious capsule sizes of 2,4 and 8. The configurations [4-6]correspond to the settings when the last 3 convolution layersare replaced in the discriminator with subspace capsule con-volution layers. We conclude the following results from thisanalysis. 1- Subspace capsule layers are effective in improv-ing the overall performance even if we use them only in onelayer of the discriminator network. 2- The proper combina-tion of capsule types and the capsule dimension plays a keyrole in achieving the best performance. config SC-Fc SC-Conv Error rate0 (Salimans et al. 2016) - - 8.111 (10,2) - 5.82 (10,4) - 5.123 (10,8) - 5.24 - (64,2) 5.265 - (32,4) 5.496 - (16,8) 5.377 (10,4) (64,2) 4.58
Table 5: Error rate of semi-supervised classification forSVHN dataset for 1000 labeled samples for various size andtype of subspace capsule. SC-Fc stands for subspace cap-sule fully connected layer. In a tuple (n,c), n is the numberof capsule types and c is the subspace capsule dimension.
In this paper, we proposed SubSpace Capsule Networks, re-ferred to as
SCN s, which offer a general capsule model withno computational overhead compared to CNNs.
SCN learnsa group of capsule subspaces to model the variations in theproperties of an entity through the sequence of layers. Wesuccessfully applied
SCN on the GAN framework, both ongenerator and discriminator networks leading to the state-of-the-art performance in semi-supervised classification onCIFAR10 and SVHN and significantly improving the qual-ity of generated samples.
This research is based upon work supported in parts by theNational Science Foundation under Grants No. 1741431 andOffice of the Director of National Intelligence (ODNI), In-telligence Advanced Research Projects Activity (IARPA),via IARPA R & D Contract No. D17PC00345. The views,findings, opinions, and conclusions or recommendationscontained herein are those of the authors and should not beinterpreted as necessarily representing the official policiesor endorsements, either expressed or implied, of the NSF,ODNI, IARPA, or the U.S. Government. The U.S. Govern-ment is authorized to reproduce and distribute reprints forGovernmental purposes notwithstanding any copyright an-notation thereon. We also would like to thank Dr. Jun Wangfor generously providing us access to the CASS GPU clus-er supported in parts by the US Army/DURIP programW911NF-17-1-0208.
References [Arjovsky, Chintala, and Bottou 2017] Arjovsky, M.; Chintala, S.;and Bottou, L. 2017. Wasserstein gan. arXiv preprintarXiv:1701.07875 .[Bahadori 2018] Bahadori, M. T. 2018. Spectral capsule networks.[Bass et al. 2019] Bass, C.; Dai, T.; Billot, B.; Arulkumaran, K.;Creswell, A.; Clopath, C.; De Paola, V.; and Bharath, A. A. 2019.Image synthesis with a convolutional capsule generative adversar-ial network. In
International Conference on Medical Imaging withDeep Learning , 39–62.[Brock, Donahue, and Simonyan 2018] Brock, A.; Donahue, J.;and Simonyan, K. 2018. Large scale gan training for high fidelitynatural image synthesis. arXiv preprint arXiv:1809.11096 .[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.;and Fei-Fei, L. 2009. ImageNet: A Large-Scale Hierarchical ImageDatabase. In
CVPR09 .[Denman and Beavers Jr 1976] Denman, E. D., and Beavers Jr,A. N. 1976. The matrix sign function and computations in sys-tems.
Applied mathematics and Computation
Thirty-Second AAAI Conference on Artificial Intelligence .[Duarte, Rawat, and Shah 2018] Duarte, K.; Rawat, Y.; and Shah,M. 2018. Videocapsulenet: A simplified network for action de-tection. In
Advances in Neural Information Processing Systems ,7621–7630.[Dumoulin et al. 2016] Dumoulin, V.; Belghazi, I.; Poole, B.; Mas-tropietro, O.; Lamb, A.; Arjovsky, M.; and Courville, A. 2016.Adversarially learned inference. arXiv preprint arXiv:1606.00704 .[Edraki and Qi 2018] Edraki, M., and Qi, G.-J. 2018. Generalizedloss-sensitive adversarial learning with manifold margins. In
Pro-ceedings of the European Conference on Computer Vision (ECCV) ,87–102.[Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza,M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Ben-gio, Y. 2014. Generative adversarial nets. In
Advances in neuralinformation processing systems , 2672–2680.[Gulrajani et al. 2017] Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Du-moulin, V.; and Courville, A. C. 2017. Improved training ofwasserstein gans. In
Advances in Neural Information ProcessingSystems , 5767–5777.[Han et al. 2017] Han, X.; Wu, Z.; Jiang, Y.-G.; and Davis, L. S.2017. Learning fashion compatibility with bidirectional lstms. In
Proceedings of the 25th ACM international conference on Multi-media , 1078–1086. ACM.[Higham 1997] Higham, N. J. 1997. Stable iterations for the matrixsquare root.
Numerical Algorithms
Interna-tional Conference on Artificial Neural Networks , 44–51. Springer.[Hinton, Sabour, and Frosst 2018] Hinton, G. E.; Sabour, S.; andFrosst, N. 2018. Matrix capsules with em routing.[Jaiswal et al. 2018] Jaiswal, A.; AbdAlmageed, W.; Wu, Y.; andNatarajan, P. 2018. Capsulegan: Generative adversarial capsulenetwork. In
European Conference on Computer Vision , 526–535.Springer. [Karras et al. 2017] Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J.2017. Progressive growing of gans for improved quality, stability,and variation. arXiv preprint arXiv:1710.10196 .[Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, G.2009. Learning multiple layers of features from tiny images. Tech-nical report, Citeseer.[LaLonde and Bagci 2018] LaLonde, R., and Bagci, U. 2018. Cap-sules for object segmentation. arXiv preprint arXiv:1804.04241 .[Liu et al. 2015] Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015.Deep learning face attributes in the wild. In
Proceedings of Inter-national Conference on Computer Vision (ICCV) .[Mazaheri and Shah 2018] Mazaheri, A., and Shah, M. 2018. Vi-sual text correction. In
Proceedings of the European Conferenceon Computer Vision (ECCV) , 155–171.[Miyato et al. 2018a] Miyato, T.; Kataoka, T.; Koyama, M.; andYoshida, Y. 2018a. Spectral normalization for generative adver-sarial networks. arXiv preprint arXiv:1802.05957 .[Miyato et al. 2018b] Miyato, T.; Maeda, S.-i.; Ishii, S.; andKoyama, M. 2018b. Virtual adversarial training: a regularizationmethod for supervised and semi-supervised learning.
IEEE trans-actions on pattern analysis and machine intelligence .[Netzer et al. 2011] Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.;Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images withunsupervised feature learning.[Odena, Olah, and Shlens 2017] Odena, A.; Olah, C.; and Shlens, J.2017. Conditional image synthesis with auxiliary classifier gans.In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , 2642–2651. JMLR. org.[Sabour, Frosst, and Hinton 2017] Sabour, S.; Frosst, N.; and Hin-ton, G. E. 2017. Dynamic routing between capsules. In
Advancesin Neural Information Processing Systems , 3856–3866.[Salimans et al. 2016] Salimans, T.; Goodfellow, I.; Zaremba, W.;Cheung, V.; Radford, A.; Chen, X.; and Chen, X. 2016. Im-proved techniques for training gans. In Lee, D. D.; Sugiyama,M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds.,
Advances inNeural Information Processing Systems 29 . Curran Associates, Inc.2234–2242.[Singh et al. 2019] Singh, M.; Nagpal, S.; Singh, R.; and Vatsa, M.2019. Dual directed capsule network for very low resolution imagerecognition. In
Proceedings of the IEEE International Conferenceon Computer Vision , 340–349.[Zhang, Edraki, and Qi 2018] Zhang, L.; Edraki, M.; and Qi, G.-J.2018. Cappronet: Deep feature learning via orthogonal projectionsonto capsule subspaces. In
Advances in Neural Information Pro-cessing Systems , 5819–5828.[Zhao et al. 2018] Zhao, W.; Ye, J.; Yang, M.; Lei, Z.; Zhang, S.;and Zhao, Z. 2018. Investigating capsule networks with dynamicrouting for text classification. arXiv preprint arXiv:1804.00538arXiv preprint arXiv:1804.00538