[PDF] Self-Attention Capsule Networks for Object Classification

Abstract

We propose a novel architecture for object classification, called Self-Attention Capsule Networks (SACN). SACN is the first model that incorporates the Self-Attention mechanism as an integral layer within the Capsule Network (CapsNet). While the Self-Attention mechanism supplies a long-range dependencies, results in selecting the more dominant image regions to focus on, the CapsNet analyzes the relevant features and their spatial correlations inside these regions only. The features are extracted in the convolutional layer. Then, the Self-Attention layer learns to suppress irrelevant regions based on features analysis and highlights salient features useful for a specific task. The attention map is then fed into the CapsNet primary layer that is followed by a classification layer. The proposed SACN model was designed to solve two main limitations of the baseline CapsNet - analysis of complex data and significant computational load. In this work, we use a shallow CapsNet architecture and compensates for the absence of a deeper network by using the Self-Attention module to significantly improve the results. The proposed Self-Attention CapsNet architecture was extensively evaluated on six different datasets, mainly on three different medical sets, in addition to the natural MNIST, SVHN and CIFAR10. The model was able to classify images and their patches with diverse and complex backgrounds better than the baseline CapsNet. As a result, the proposed Self-Attention CapsNet significantly improved classification performance within and across different datasets and outperformed the baseline CapsNet, ResNet-18 and DenseNet-40 not only in classification accuracy but also in robustness.

Full PDF

SSelf-Attention Capsule Networks for Object Classiﬁcation

Assaf Hoogi , Brian Wilcox ∗ , Yachee Gupta ∗ , Daniel L. Rubin Dept. of Biomedical Data Science, Stanford University Dept. of Computer Science and Applied Math, The Weizmann Institute of Science Dept. of Electrical Engineering, Stanford University ∗ Equal contributors

Abstract

We propose a novel architecture for object classiﬁcation,called Self-Attention Capsule Networks (SACN). SACN isthe ﬁrst model that incorporates the Self-Attention mecha-nism as an integral layer within the Capsule Network (Cap-sNet). While the Self-Attention mechanism supplies a long-range dependencies, results in selecting the more dominantimage regions to focus on, the CapsNet analyzes the rele-vant features and their spatial correlations inside these re-gions only. The features are extracted in the convolutionallayer. Then, the Self-Attention layer learns to suppress ir-relevant regions based on features analysis and highlightssalient features useful for a speciﬁc task. The attention mapis then fed into the CapsNet primary layer that is followedby a classiﬁcation layer. The proposed SACN model wasdesigned to solve two main limitations of the baseline Cap-sNet - analysis of complex data and signiﬁcant computa-tional load. In this work, we use a shallow CapsNet ar-chitecture and compensates for the absence of a deeper net-work by using the Self-Attention module to signiﬁcantlyimprove the results. The proposed Self-Attention Cap-sNet architecture was extensively evaluated on six differentdatasets, mainly on three different medical sets, in additionto the natural MNIST, SVHN and CIFAR10. The modelwas able to classify images and their patches with diverseand complex backgrounds better than the baseline Cap-sNet. As a result, the proposed Self-Attention CapsNet sig-niﬁcantly improved classiﬁcation performance within andacross different datasets and outperformed the baseline Cap-sNet, ResNet-18 and DenseNet-40 not only in classiﬁcationaccuracy but also in robustness.

1. Introduction

Object classiﬁcation is a very challenging task,mostly because of the signiﬁcant intra-class and inter-

Figure 1. Classiﬁcation of random patch examples - CTLiver lesions (LiTS public). Each pair of images (in the samerow) represent the original image with the radiologist’s le-sion annotation (green) and the processed image. Red -classiﬁed lesion patches, Yellow - classiﬁed normal patches.The dataset contains difﬁcult cases such as low contrast andhighly heterogeneous lesions. class variability 1, arising from different image acqui-sition conditions, rigid and non-rigid deformations,occlusions and corruptions. Handcrafted low-levelfeatures were proposed to handle these challenges,while unsupervised learning approaches are regularlydeveloped to avoid the limitations of handcrafted fea-tures such as being user dependant.Recent advances in computer vision highlight thecapabilities of deep learning approaches to solve thesechallenges, achieving state of the art performancesin many classiﬁcation tasks. The main reason forthe success of deep learning is the ability of Con-volutional Neural Networks to learn a hierarchicalrepresentation of the input data. AlexNet, which waspresented by Krizhevsky et al. [7] was one of theﬁrst and the simplest architectures for object/image1 a r X i v : . [ c s . C V ] N ov lassiﬁcation. Later on, a deeper VGG16 model wasintroduced, dealing with nonlinear transformations[17]. ResNet-18 was then developed to solve acommon problem in deep learning of increased testerror rate while increasing the architecture depth[4]. DenseNet-40 [5] was recently developed and iscurrently considered as the state of the art methodfor object/image classiﬁcation tasks. It is similar tothe ResNet-18 architecture with the difference beingsigniﬁcantly densely connected feature maps in theﬁnal layer of a dense block instead of a residual block.Deep learning-based approaches became popularalso in medical image domain due to the increasingcomputational power and availability of data. How-ever, these methods still lack in robustness acrossdifferent datasets and require a signiﬁcant amountof annotated data. These limitations are even moresubstantial in the medical domain (relative to thenatural domain) because the annotated data is highlyheterogeneous and its size is relatively small. Totackle these challenges, the U-Net architecture wasdeveloped. It includes skip connections and was de-signed for medical images, wherein these additionalconnections can extract larger amounts of informationfrom the limited data size [15] [22].CapsNet is one of the most recent architecturesthat was developed by Hinton’s group for objectclassiﬁcation [16]. It is powerful and was designed todeal with small datasets, as is typical for the medicaldomain. It learns the spatial correlations betweenobjects and can recognize multiple objects in theimage even if they overlap. However, CapsNet doesnot learn the important local features. Therefore,developing a new architecture that will help to solvethe mentioned limitations is highly desired and canhelp in advancing the ﬁeld of object classiﬁcation,especially in the medical domain.This paper presents a signiﬁcant improvement of theCapsule Networks architecture and has several keycontributions. • We introduce a novel architecture, called

Self-Attention Capsule Networks (SACN) . The archi-tecture includes an integral Self-Attention layerthat lies between the convolutional and the pri-mary CapsNet layers. This allows the model pa-rameters, even in shallower layers, to be updatedmostly based on image regions that are more rel-evant to a given task. The attention mechanism,which is used as a non-local operation, solvesthe task of learning the important features whilethe CapsNet considers the positional / rotational spatial relation between these features. There-fore, integrating Self-Attention module withinthe CapsNet architecture has an important com-plementary role . • The proposed architecture is designed to workwell under the constraint of limited computa-tional resources . While the baseline CapsNet [16]is considered an expensive architecture in termsof computational cost, the proposed model wasdesigned to use a shallow CapsNet architectureand compensates the absence of a deeper net-work by using the Self-Attention module to sig-niﬁcantly improve the feature extraction.

Ourproposed SACN enjoys both worlds - it keepscomputational efﬁciency while having a largereceptive ﬁeld at the same time, obtaining long-range dependencies to improve performance. • We are not familiar with other works that weretested on both medical and natural image do-mains, as these domains have substantially dif-ferent image characteristics. Here we conductedextensive experiments to show the generaliza-tion of our model . We showed that the proposedmodel was able to supply more accurate, robustand stable object classiﬁcation within and acrossdifferent datasets. Moreover, these datasets con-tains complex lesions such as low contrast le-sions or lesions with high heterogeneity.

We wereable to show that our proposed SACN modelperforms better than the baseline CapsNet oncomplex data, the weakest link of the baselinemethod .

2. Related work

Recently developed Capsule networks represent abreakthrough in the ﬁeld of neural networks. TheCapsNet architecture contains three types of layers- the convolutional layer, the primary capsule layerand the classiﬁcation (digit) capsule layer [16]. Cap-sule networks are powerful because of two key ideasthat distinguish them from the traditional CNNs; 1)dynamic routing-by-agreement instead of max pool-ing, and 2) squashing, where scalar output feature de-tectors of CNNs are replaced with vector output cap-sules. Routing-by-agreement means that it is possibleto selectively choose which parent in the layer abovethe capsule is sent to. For each optional parent, thecapsule network can increase or decrease the connec-tion strength. As a result, the CapsNet can keep thespatial correlations between objects within the image216]. Squashing means that instead of having individ-ual neuron’s sent through non-linearities as is done inCNNs, capsule networks have their output squashedas an entire vector. The squashing function enables abetter representation of the probability that an objectis present in the input image.

This better represen-tation, in addition to the fact that no max-pooling isapplied (i.e. minimizing the information loss), en-ables CapsNet to handle small and sparse set of im-ages, as is typical for medical imaging .Afshar et al. [1] incorporated CapsNets for brain tu-mor classiﬁcation. The authors investigated the over-ﬁtting problem of CapsNets based on a real set of MRIimages. Their results show that CapsNet can success-fully outperform CNNs for the brain tumor classiﬁ-cation problem. However, CapsNet is an expensivearchitecture in terms of computational and memoryloads.

As a result, the commonly-used CapsNets arerelatively shallow architectures, which were provedto be better mainly for simple datasets. They did notperform well for more complex data . Deliege et al. [2]introduced HitNet, a deep learning network charac-terized by the use of a Hit-or-Miss layer composed ofcapsules. The idea is that the capsule corresponding tothe true class has to make a hit in its target space, andthe other capsules have to make misses. The methodconverged faster than CapsNet but their results werenot able to outperform CapsNet for complex datasets.In [20], the authors explored the effect of a variety ofCapsNet modiﬁcations, ranging from stacking morecapsule layers to trying out different parameters suchas increasing the number of primary capsules or cus-tomizing an activation function. However, the bestvalidation accuracy for a relatively complex datasetthat their architecture reached was 71.55%, only com-parable to CapsNet performance on the same dataset.They mentioned that computational resources lim-ited their performance. Another architecture, DiverseCapsule Networks, introduced in [12], was able tosupply only a 0.31% improvement over the baselineCapsNet accuracy.

The Self-Attention mechanism can help the modelfocus on more relevant regions inside the imageand gain better performance for classiﬁcation taskswith fewer data samples [13] or more complex im-age backgrounds. Attention mechanism allows mod-els to learn deeper correlations between objects [9]and helps discover interesting new patterns withinthe data [6] [11]. Additionally, it helps in modelinglong-range, multi-level dependencies across differentimage regions. Wang et al. [19] address the speciﬁc problem of CNNs processing information too locallyby introducing a Self-Attention mechanism, where theoutput of each activation is modulated by a subsetof other activations. This helps the CNN to considersmaller parts of the image if necessary. Larochelleand Hinton [8] proposed using Boltzmann Machinesthat choose where to look next to ﬁnd locations of themost informative intra-class objects, even if they arefar away in the image. Reichert et al. proposed ahierarchical model to show that certain aspects of at-tention can be modeled by Deep Boltzmann Machines[14]. Attention-based models were also proposed forgenerative models. In [18], the authors introduce aframework to infer the region of interest for genera-tive models of faces. Their framework is able to passthe relevant information only, through the generativemodel. Recent technique that focuses on generativeadversarial models is called SAGAN [21]. The authorsproposed Self-Attention Generative Adversarial Net-works (SAGAN) that achieve state-of the art genera-tive results on the ImageNet dataset.Recent work that deals speciﬁcally with medical datacan be found in [10]. This work presents an attentionmechanism that is incorporated in the U-Net architec-ture for tissue/organ identiﬁcation and localization.However, U-Net was mainly developed for segmenta-tion tasks in the medical domain, rather than for objectclassiﬁcation.Our SACN model plays a key role in advancing themedical imaging, as most classiﬁcation tasks in thisdomain need positional relationships between fea-tures to perform optimally. By using our architec-ture, we can focus the attention on relevant locationsin the input and analyze the spatial relationships be-tween their features by taking advantage of the Cap-sNet structure.

3. The proposed model

Our proposed model is illustrated in Figure 2. Let x ∈ R C × N be the output feature matrix extracted fromthe initial convolutional layer of the CapsNet, which isthen fed into a Self-Attention module. Let f ( x ) , g ( x ) and h ( x ) be three feature extractors. h ( x ) has thesame number of channels ( C ) as the input. We foundthat 512 channels supplies the best performance ac-curacy. f ( x ) and g ( x ) are position modules that areused to calculate attention. f ( x i ) and g ( x j ) take in-put feature maps at the i th and j th positions. f ( x i ) and g ( x j ) both have a reduced number of channels( C /8), comparing with h ( x ) . This allows us to ﬁlterout noisy input channels and care only about the fea-tures that are relevant to the attention mechanism. Ac-cording to the dominant features, we pass only rele-3 igure 2. Our proposed SACN architecture. vant activations through the Primary Capsule layer.Inside the attention module we use 2-D non-strided1 × η ij ( x ) = f ( x i ) T g ( x j ) (1) f ( x i ) = W f x i , g ( x i ) = W g x i , where W g ∈ R C × C and W f ∈ R C × C are the learned weight matrices. We thencompute the softmax of η ij to get an output attentionmap β ij , β ij = exp ( η ij ) ∑ Ni = exp ( η ij ) (2)To obtain the ﬁnal Self-Attention map, o ∈ R C × N ,which will be the input of the primary CapsNet cap-sule, we apply matrix multiplication between the at-tention map β ij and h ( x i ) , o j = N ∑ i = β ij h ( x i ) (3)where h ( x i ) = W h x i is the third input feature chan-nel with C channels (see Figure 2) and similarly to W f and W g , W h is also a learned weight matrix. By virtueof this matrix multiplication step, the Self-Attentionmechanism applies a weighted sum over all derivedfeatures of h ( x i ) and ﬁlters out the ones that have theleast affect. Therefore, the ﬁnal output of the Self-Attention layer is y i = α o i + x i (4)In our model, α is initialized to 0. As a result, themodel can explore the local spatial information ﬁrst, before automatically reﬁning it with the Self-Attentionand analyzing higher data complexity by consideringfurther regions in the image. Then, the network grad-ually learns to assign higher weight to the non-localregions. By initializing α to 0 and with no requirementof other pre-deﬁned parameters, we are not depen-dent on the user input, contrary to common attentionmechanisms.The ﬁnal output of the Self-Attention module, y i , isthen fed into the CapsNet primary layer. Let v j be theoutput vector of capsule j . The length of the vector,which represents the probability of whether or not aspeciﬁc object is located in that given location in theimage, should be between 0 and 1. To ensure that, weapply a squashing function that keeps the positionalinformation of the object. Short vectors are shrunkto almost 0 length and long vectors are brought to alength slightly below 1. The squashing function is de-ﬁned as v j = || ∑ i c ij W ij y i || ( + || ∑ i c ij W ij y i || ) ∑ i c ij W ij y i || ∑ i c ij W ij y i || (5)where W ij is a weight matrix and c ij are the couplingcoefﬁcients between capsule i and all the capsules inthe layer above j that are determined by the iterativedynamic routing process c ij = exp ( b ij ) ∑ j exp ( b ij ) (6) b ij are the log prior probabilities that i th capsuleshould be coupled to j th capsule.To obtain a reconstructed image during training,we use the vector v j that supplies the highest couplingcoefﬁcient, c ij . Then, we feed the correct v j throughtwo fully connected ReLU layers. The reconstructionloss L R ( I , ˆ I ) of the architecture is deﬁned as,4 R ( I , ˆ I ) = || I − ˆ I || (7)where I is the original input image and ˆ I is the recon-structed image. L R ( I , ˆ I ) is used as a regularizer thattakes the output of the chosen v j and learns to recon-struct an image, with the loss function being the sumof squared differences between the outputs of the lo-gistic units and the pixel intensities (L2-Norm). Thisforces capsules to learn features that are useful forthe reconstruction procedure which inherently allowsfor the model to learn features at near-pixel precision.Therefore, the better the reconstruction loss the pre-diction. The reconstruction loss is then added to thefollowing margin loss function, L M , L M = ∑ k T k max ( m + − || v k || ) + ∑ k λ ( − T k ) max ( || v k || − m − ) (8) T k = m + = m − = L T , which is the total of all losses over all classes k, L T = L M + ξ I size L R (9) ξ = L M during training. I size = H ∗ W ∗ C is the number of input values, based on the height,width and number of channels in the input. The input for the SACN architecture depends onthe domain, the task and the data complexity. One ofthe main limitations of CapsNet refers to the analysisof complex data. Previous CapsNet research deﬁnedcomplex data as a data with a signiﬁcant backgroundheterogeneity. For the medical domain we consideredpatch-wise CapsNet, similar to [23]. Patch-wiseanalysis was chosen because of the desired task - localtissue classiﬁcation. Moreover, by using patch-wisemethodology we could also reduce the data complex-ity, to some extent. Patch size of 16 ×

16 pixels waschosen as it supplied the best classiﬁcation results. Avalue of 0.5 was chosen for the λ down-weighting ofthe loss, together with a weight variance of 0.15. Sincethe CapsNet is a relatively expensive architecture, interms of computational load, we designed our archi-tecture to work well under the constraint of limited computational resources and boost the performanceby adding the Self-Attention module. Therefore,our CapsNet architecture contains one convolutionallayer with 5 × × e − . Thirty epochs were usedbecause this was sufﬁcient to train the small dataset.Algorithm 1 describes the training process of theproposed model. Data: I , G : Pairs of image I and the ground truth G Result: Y out : Final instance classiﬁcation while not converging doCapsNet Convolutional Layer : features areextracted and are divided into three outputfeature vectors ( f ( x ) , g ( x ) , h ( x ) ). Attention Layer : Self-Attention map y i isgenerated based on the features vectors,attention map β ij , learned weight matrix W h and a speciﬁc image location x i . Primary and Classiﬁcation Layers : thedominant features are then fed into thePrimary CapsNet layer and from there to theClassiﬁcation layer. Output classiﬁcation Y out is obtained. Calculate the Attention-based CapsNet loss : L T ← Loss ( G , Y out ) Back-propagate the loss and compute : ∂ L ∂ W Update the weights : matrices W are updatedfor both the Self-Attention layer and theCapsNet architecture. endAlgorithm 1: Our SACN training process

4. Experiments

We conducted extensive experiments on highly di-verse medical data, and present initial results for natu-ral data as well (as described in the ”Natural datasets”subsection). The medical dataset is composed of threeseparate subsets of images, each contains cancer le-sions that are located at different body organs andwere screened by different imaging modalities. Twosubsets were collected by radiologists at Stanford hos-pital (250 CT Lung images and 369 MR Brain tumors)and the third set of 1102 CT Liver lesions, is a pub-lic one (LiTS). In addition to the differences in the or-5ans and the imaging modalities, these datasets aredifferent from each other also regarding other acqui-sition criteria; 1) their spatial resolution is within therange of 0.78 mm / pixel − mm / pixel and 2) theirslice thickness ranges from 2.5 mm to 5 mm . These dif-ferences affect the appearance of the cancer lesions,characterized by a different noise level, homogene-ity or contrast relative to the surrounding normal tis-sue. Each subset has its major challenges but the CTLung dataset is considered the more difﬁcult datasetfor patch classiﬁcation due to low-contrast lesions andto the similarity to the lung blood vessels, while CTLiver is the easiest one. The inter- and intra-variabilitybetween sets of images is shown in Figure 1 and inFigure 3. An external expert annotated two separateregions in each image - normal tissue and cancer le-sion. Thirty patches were extracted from each region,means that each training image supplied 60 samplesto the whole training cohort. For all experiments, weused 80% of the dataset for training, 10% for testingand 10% percent for validation. The performance of our method was measured asa patch-wise classiﬁcation - normal or lesion patches.To evaluate the capabilities of our proposed method,we compared the developed architecture with 1) thebaseline CapsNet that this work mainly aims to im-prove, and with 2) the state of the art ResNet-18and DenseNet-40 architectures. The DenseNet-40 andResNet-18 were adjusted for being able to analyzesmall image patches. Similar to what was done whenapplying ResNet for analysis of CIFAR10 images, weremoved some max-pooling layers to ensure that theinformation will not be lost, prevents the receptiveﬁeld from shrinking too quickly. We did not want toup-sample the patches too much because it can addnoise, which limits the overall performance. There-fore, we carefully considered the up-sampling/max-pooling trade-off and chose the one that supplied thebest performacnce. We evaluated the effectiveness ofthese methods by calculating several statistics. Statis-tical signiﬁcance between the methods was calculatedby using Wilcoxon paired test.

5. Results

Figure 3 shows the classiﬁcation results of a subsetof randomly chosen patches. For the purpose of vi-sualization, only the colored patches have been clas-siﬁed into normal/lesion regions. The ﬁgure showsthe substantial diversity of the image characteristics, within and across subsets (CT Lung, CT Liver andMR Brain). Our method shows its ability to handlesmall lesions, highly heterogeneous lesions and lowcontrast lesions. It can also distinguish very well be-tween normal structures within the tissue (e.g. bloodvessels in CT lung, normal structures in the MR Brainimage) and cancer lesions. All these challenges, whichusually fail common techniques, are dealt well by ourproposed method.

Table 1 presents the classiﬁcation accuracy of all patches in the testing set (contrary to the subset ofpatches that is visualized in ﬁgure 3). We set the op-timal parameters for every architecture we comparedour SACN with, to ensure that any difference betweenthe performances is directly related to the novelty ofour architecture. Table 1 shows that our method out-performs all other methods for each subset that hasbeen analyzed and for each of the following param-eters that has been explored.

First , the performanceaccuracy within each speciﬁc subset (Liver, Lung,Brain) is consistently higher when using our proposedmethod.

Second , the standard deviation (std) of theclassiﬁcation accuracy over different images withinthe same subset is lower than the equivalent valueswhen using the baseline CapsNet, DenseNet-40 andResNet-18.

Third , the robustness and the stabilityacross different subsets are also signiﬁcantly higherwhen using our model.It is worth mentioning that the performance differ-ence between our technique and the other methods wecompared with, becomes more signiﬁcant and goingalong with the level of the data complexity. This keyresult enhances the strength of our method. For exam-ple the difference between our method and the othersis larger for CT Lung and smaller for CT Liver.To ensure that our architecture does not overﬁt tothe training data, we explored the loss/error rates ofthe training and the validation sets for each individualtested subset (Figure 4). It can be clearly seen that theloss of the training and the validation sets are compa-rable, having the same trend and without a substantialdifferences between them.

We showed results for natural data as well, ex-ploring the generalization of the proposed techniqueto other domains, except for the medical one. TheMNIST database includes a training set of 60,000hand-written digits examples, and a test set of 10,000examples. The Street View House Numbers (SVHN)6 igure 3. Classiﬁcation of selected patches. Each pair of images (in the same row) represents the original image with theradiologist’s lesion annotation (green) and the processed image. Red - classiﬁed lesion patches, Yellow - classiﬁed normalpatches. Upper row - CT Lung, Bottom row - MR Brain. In each image, we show classiﬁcation results for the example coloredpatches.

Dataset ResNet-18 [3] DenseNet [5] Baseline CapsNet[16]

Our SACN

Liver (LiTS) 0.87 ± ± ± ± Brain 0.91 ± ± ± ± Lung 0.87 ± ± ± ± Table 1. Comparison (mean, std) of our proposed method with baseline CapsNet, DenseNet and ResNet-18 architectures. Themean and the std values were calculated for different images in the subset. Wilcoxon paired test was calculated ( p < The best results for each subset are bolded .Figure 4. Calculated loss for training and for validation sets.Upper -MR Brain, Bottom - CT Lung. is a real-world image dataset. It contains 600,000digit images that come from a signiﬁcantly harder realworld problem compared to MNIST. The images lackany contrast normalization as well as contain overlap-ping digits and distracting features which makes thema much more difﬁcult classiﬁcation dataset compared to MNIST. CIFAR10 is the third natural dataset thatwe analyzed by our proposed SACN technique. Thedataset consists of 60000 images in 10 classes, with6000 images per class. There are 50000 training im-ages and 10000 test images. On contrary to the medi-cal datasets, we used an image-wise classiﬁcation fornatural images. Image-wise analysis was applied be-cause 1) the task was classiﬁcation of the whole imageinto a speciﬁc class, and because 2) the object and itsbackground in the natural domain were not consid-ered to be too complex ones.We chose a batch size of 64 for MNIST and for CI-FAR10 and 32 for SVHN, a learning rate of 2 e − and 60epochs. A weight variance of 0.01 was used. Becausethe natural domain was not the main focus of thiswork (but it was still important to show generalizabil-ity of our model), we compared the performance ofour proposed SACN with the baseline CapsNet only,showing the superiority of our methodology on thebaseline one. For the MNIST dataset, we obtained aclassiﬁcation accuracy of 0.995, which is comparableto the state of the art methods and to the baseline Cap-sNet architecture. For the SVHN, we were able to im-7rove the classiﬁcation accuracy of the baseline Cap-sNet, which is already pretty high, by 2.4%. Lastly,the classiﬁcation of CIFAR10 was improved by 3.5%by using our SACN, comparing the baseline CapsNet.

6. Discussion and Conclusion

This paper introduces a novel architecture, calledSelf-Attention Capsule Networks (SACN), whichwas proposed to speciﬁcally improve the knownCapsNet architecture.

The architecture utilizes theimportant key ideas of the CapsNet architecture, andboosts its performance by incorporating the Self-Attention mechanism as an integral layer within theCapsNet architecture. Our proposed architecture al-lows the model parameters, even in shallower layers,to be updated mostly based on image regions that aremore relevant to a given task.We conducted an extensive set of experiments, focus-ing on the medical domain but presenting also ananalysis of natural images. For the medical subsets,which were part of highly diverse cohort, our pro-posed method signiﬁcantly outperformed the baselineCapsNet. We also compared our technique with theadvanced state of the art architectures - DenseNet-40and ResNet-18. Our method was signiﬁcantly betterfrom these architectures as well. The better perfor-mance of our model is reﬂected in higher accuracyand lower standard deviation. Table 1 shows a key ad-vantage of the proposed SACN over the baseline Cap-sNet, ResNet-18 and DenseNet architectures - whenthe cohort is more complex, the strength of the pro-posed method becomes more dominant . This obser-vation is well ﬁtted to the known CapsNet limitation,which tries to account for everything present in an im-age, and for more complex images, where the back-ground is too diverse, the CapsNet does not performwell in either case.In regards to the public natural data that we ana-lyzed, we were able to show classiﬁcation accuracythat was comparable or better than the CapsNet orother state of the art methods that were reported inliterature.

Implementing the model across substan-tially diverse datasets and domains shows its highgeneralization, robustness and classiﬁcation capa-bilities.

The baseline CapsNet is considered an expensivearchitecture in terms of computational load. For ex-ample, analyzing some of the datasets with the base-line CapsNet, resulted in Out of Memory errors on theGPU resources. In this work, we were able to sup-ply classiﬁcation accuracy that is signiﬁcantly betterthan the baseline CapsNet architecture by using a rel-atively shallow CapsNet architecture and incorporat- ing the attention module.

We were able to supplybetter results with less computational load that wasreported in literature as a CapsNet cause for processshutdown . Our architecture is powerful and has po-tential to be widely-used as it requires less computa-tional resources.Future work will include additional experiments,focusing on more complex natural and medicaldatasets. These experiments will be conducted for2D and 3D data, using additional computational re-sources.

References [1] P. Afshar, A. Mohammadi, and K. N. Plataniotis. Braintumor type classiﬁcation via capsule networks. , Oct 2018.[2] A. Deli`ege, A. Cioppa, and M. V. Droogenbroeck. Hit-net: a neural network with capsules embedded in a hit-or-miss layer, extended with hybrid data augmentationand ghost capsules. arXiv preprint arXiv:1806.06519 ,2018.[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition.

CoRR , abs/1512.03385,2015.[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. , 2016.[5] G. Huang, Z. Liu, and K. Q. Weinberger. Densely con-nected convolutional networks.

CoRR , abs/1608.06993,2016.[6] Y. Jo, H. Cho, S. Y. Lee, G. Choi, G. D. Kim, H. L. Min,and Y. Park. Quantitative phase imaging and artiﬁcialintelligence: A review.

IEEE Journal of Selected Topics inQuantum Electronics , 25:1–14, 2019.[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-genet classiﬁcation with deep convolutional neural net-works. In

Advances in Neural Information Processing Sys-tems 25 , pages 1097–1105, 2012.[8] H. Larochelle and G. E. Hinton. Learning to combinefoveal glimpses with a third-order boltzmann machine.In

Advances in Neural Information Processing Systems 23 ,pages 1243–1251, 2010.[9] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Re-current models of visual attention. In

Advances in Neu-ral Information Processing Systems , volume 3, 2014.[10] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Hein-rich, K. Misawa, K. Mori, S. McDonagh, N. Y. Ham-merla, B. Kainz, B. Glocker, and D. Rueckert. Attentionu-net: Learning where to look for the pancreas. arXivpreprint arXiv:1804.03999 , 2018.[11] B. A. Olshausen, C. H. Anderson, and D. C. V. Essen. Aneurobiological model of visual attention and invariantpattern recognition based on dynamic routing of infor-mation.

Journal of Neuroscience , 13:4700–4719, 1993.

12] S. S. R. Phaye, A. Sikka, A. Dhall, and D. Bathula.Dense and diverse capsule networks: Making the cap-sules learn better, 2018.[13] W. Qian, Z. Jiaxing, S. Sen, and Z. Zheng. Attentionalneural network: Feature selection using cognitive feed-back. In

Advances in Neural Information Processing Sys-tems 27 , pages 2033–2041, 2014.[14] D. P. Reichert, P. Seri`es, and A. J. Storkey. A hierarchicalgenerative model of recurrent object-based attention inthe visual cortex. In

ICANN , 2011.[15] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-volutional networks for biomedical image segmenta-tion.

Medical Image Computing and Computer-Assisted In-tervention – MICCAI 2015 , pages 234—-241, 2015.[16] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic rout-ing between capsules. arXiv preprint arXiv:1710.09829 ,2017.[17] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.[18] Y. Tang, N. Srivastava, and R. Salakhutdinov. Learninggenerative models with visual attention. arXiv preprintarXiv:1312.6110 , 2013.[19] X. Wang, R. Girshick, A. Gupta, and K. He. Non-localneural networks. arXiv preprint arXiv:1711.07971 , 2017.[20] E. Xi, S. Bing, and Y. Jin. Capsule network performanceon complex data.

CoRR , abs/1712.03480, 2017.[21] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena.Self-attention generative adversarial networks. arXivpreprint arXiv:1805.08318 , 2018.[22] M. Zhang, X. Li, M. Xu, and Q. Li. Image segmenta-tion and classiﬁcation for sickle cell disease using de-formable u-net. arXiv preprint arXiv:1710.08149 , 2017.[23] G. ¨Ozbulak. Image colorization by capsule networks,2019., 2017.[23] G. ¨Ozbulak. Image colorization by capsule networks,2019.