[PDF] Dual Directed Capsule Network for Very Low Resolution Image Recognition

Abstract

Very low resolution (VLR) image recognition corresponds to classifying images with resolution 16x16 or less. Though it has widespread applicability when objects are captured at a very large stand-off distance (e.g. surveillance scenario) or from wide angle mobile cameras, it has received limited attention. This research presents a novel Dual Directed Capsule Network model, termed as DirectCapsNet, for addressing VLR digit and face recognition. The proposed architecture utilizes a combination of capsule and convolutional layers for learning an effective VLR recognition model. The architecture also incorporates two novel loss functions: (i) the proposed HR-anchor loss and (ii) the proposed targeted reconstruction loss, in order to overcome the challenges of limited information content in VLR images. The proposed losses use high resolution images as auxiliary data during training to "direct" discriminative feature learning. Multiple experiments for VLR digit classification and VLR face recognition are performed along with comparisons with state-of-the-art algorithms. The proposed DirectCapsNet consistently showcases state-of-the-art results; for example, on the UCCS face database, it shows over 95\% face recognition accuracy when 16x16 images are matched with 80x80 images.

Full PDF

DDual Directed Capsule Network for Very Low Resolution Image Recognition

Maneet Singh, Shruti Nagpal, Richa Singh, and Mayank VatsaIIIT-Delhi, India { maneets, shrutin, rsingh, mayank } @iiitd.ac.in Abstract

Very low resolution (VLR) image recognition corre-sponds to classifying images with resolution × orless. Though it has widespread applicability when ob-jects are captured at a very large stand-off distance (e.g.surveillance scenario) or from wide angle mobile cameras,it has received limited attention. This research presents anovel Dual Directed Capsule Network model, termed asDirectCapsNet, for addressing VLR digit and face recog-nition. The proposed architecture utilizes a combination ofcapsule and convolutional layers for learning an effectiveVLR recognition model. The architecture also incorporatestwo novel loss functions: (i) the proposed HR-anchor lossand (ii) the proposed targeted reconstruction loss, in orderto overcome the challenges of limited information contentin VLR images. The proposed losses use high resolutionimages as auxiliary data during training to “direct” dis-criminative feature learning. Multiple experiments for VLRdigit classiﬁcation and VLR face recognition are performedalong with comparisons with state-of-the-art algorithms.The proposed DirectCapsNet consistently showcases state-of-the-art results; for example, on the UCCS face database,it shows over 95% face recognition accuracy when × images are matched with × images.

1. Introduction

In typical surveillance scenarios, images are often cap-tured from a large stand-off distance, thus rendering the re-gion of interest to be of a very low resolution (VLR), oftentimes less than × [33]. Figure 1(a) shows samplereal-world applications of VLR recognition where the re-gion of interest can be a face, a suspicious object, or thelicense plate number of a moving vehicle. These samplesdemonstrate the arduous nature of the problem where someof the key challenges of VLR recognition are the presenceof limited information content and blur. VLR recognitionalso has applicability in image tagging, where multiple ob-jects/people are captured in the frame, and each of theseentities are of small resolution. (i) Digit Classification (ii) Face Recognition (a) Real-world applications of VLR recognition. Image source: (i) Inter-net, (ii) UCCS dataset [24] Capsules

VLR ImagesHR Images HR ImagesHR Images (ii) Proposed DirectCapsNet for VLR Recognition

Classification

Capsule

Class 1Class 2Class n Class n-1

Capsules

Classification

Capsule

VLR Images (i) Traditional CapsNet based VLR Recognition

Class 1Class 2Class n Class n-1

HR FeaturesVLR FeaturesHR Anchor (b) Proposed Dual Directed Capsule Network (DirectCapsNet)

Figure 1: The proposed DirectCapsNet utilizes HR samplesto direct learning of more meaningful and discriminativefeatures for VLR image recognition via the proposed HR-anchor loss and the targeted reconstruction loss.Netzer et al. [17] demonstrated the poor performanceof humans on identifying VLR digits captured in real sur-roundings. For the Street View House Numbers (SVHN)dataset, the authors observed cent percent accuracy by hu-mans for samples with − pixel height. On the otherhand, the performance dropped to . ± when clas-sifying very low resolution samples, i.e. images of heightup to 25 pixels, thereby reinstating the challenging natureof the problem. Direct up-sampling via interpolation couldbe viewed as a possible solution for VLR recognition, how-ever, multiple studies have demonstrated poor performanceowing to the required large magniﬁcation factor [14, 25] and1 a r X i v : . [ c s . C V ] A ug ossible introduction of noise, which can also be observedin Figure 1(a)(i). Further, in the literature, researchers havealso demonstrated the inability of models trained on highresolution (HR) images (containing high information con-tent) to perform well on (V)LR images [25]. The currentstate of scarce available solutions and the wide applicabil-ity of VLR recognition makes it an important problem, de-manding dedicated attention.This research proposes a novel capsule network basedmodel for VLR image recognition. Hinton et al. [7] pro-posed learning “capsules”, which represent a vector of in-stantiation parameters in order to encode the input more ef-ﬁciently. Instantiation parameters may constitute the prop-erties of an image such as the pose, lighting, and defor-mation of the visual entity relative to an implicitly deﬁnedcanonical version of that entity [7]. We believe that suchparameters would be invariant to the resolution of the im-age, therefore presenting the potential of being useful forVLR recognition. Due to the limited information contentin VLR images, the VLR recognition model could beneﬁtfrom the information-rich HR samples as well. To this ef-fect, we propose Dual Directed Capsule Network (termedas

DirectCapsNet ) (Figure 1(b)) to learn meaningful fea-tures for VLR recognition, directed (or guided) by the HRsamples. The contributions of this research are as follows: • A novel Dual Directed Capsule Network (

DirectCap-sNet ) model is proposed for VLR recognition, which directs the features learned from the VLR images con-taining limited information towards the more meaning-ful and discriminative features of the HR images. • Two losses are proposed for directing the VLR recog-nition model: (i) HR-anchor loss and (ii) targeted re-construction loss. HR-anchor loss is proposed for thefeature learning module, which pushes the VLR fea-tures of a particular class towards a representative HRfeature (anchor) of that class. Targeted reconstructionloss is utilized at the classiﬁcation module, where HRimages are reconstructed from the capsule outputs ofthe VLR images, thereby forcing the capsules of VLRand HR images of the same class to be similar. • Experimental results and analysis demonstrate the ad-vantages of the proposed DirectCapsNet model forVLR digit classiﬁcation and VLR face recognition.Experiments are performed on the SVHN [17], CMUMulti-PIE [6], and UCCS [24] databases, and compar-isons are performed with state-of-the-art algorithms.The proposed model yields over 95% accuracy onthe challenging UCCS face database. On the SVHNdatabase, it achieves about 84% classiﬁcation accuracywith × VLR images demonstrating an improvementof almost 27% from the existing results.

2. Related Work

There have been several advances in the ﬁeld of low res-olution recognition [11, 14, 18, 31]; however, the area ofvery low resolution (VLR) recognition remains relativelyless explored. As mentioned previously, very low resolu-tion (VLR) recognition refers to identifying regions of in-terest with × resolution or less. Owing to the limitedinformation content in a given VLR image, a potential so-lution is to super-resolve or synthesize its higher resolutionimage [20, 28], which is then used for recognition. Whilethere exists vast literature on super-resolution or synthesisalgorithms [13, 23, 29], most of them focus primarily on thevisual quality of the generated image, and not on the task ofrecognition. Zou and Yuen [33] proposed one of the ini-tial super resolution techniques with speciﬁc focus on VLRface recognition. The proposed algorithm utilizes a com-bination of visual quality based constraint for good qualityHR synthesis, and a discriminative constraint for learningfeatures useful for recognition. Singh et al. [25] proposedan identity-aware face synthesis technique for generating aHR image from a given LR input. The synthesized imageswere provided to a Commercial-Off-The-Shelf (COTS) sys-tem for recognition.Apart from super-resolution based techniques, in the lit-erature, researchers have also proposed algorithms for en-hancing or improving the features learned for VLR imagesby using the information extracted from the HR images. Forinstance, Bhatt et al. [2] proposed an ensemble-based co-transfer learning algorithm for face recognition. The co-transfer algorithm operates at the intersection of co-trainingand transfer learning by utilizing the information of HRimages for enhancing the VLR classiﬁcation. Wang et al. [30] proposed Robust Partially Coupled Networks for VLRrecognition. HR images are used as “auxiliary” data dur-ing training for learning more discriminative information,which might not be available in VLR images. As demon-strated via multiple experiments, using HR images at thetime of training, enhances the learned VLR features, re-sulting in improved recognition performance. Mudunuriand Biswas [16] proposed a reference-based approach alongwith multidimensional scaling for learning a common spacefor HR and VLR images. Recently, Li et al. [14] analyzeddifferent metric learning techniques for LR and VLR facerecognition, by learning a common feature space for HRand LR samples. Ge et al. [4] proposed a selective knowl-edge distillation technique for (V)LR face recognition. Abase network trained on HR face images is used for select-ing the most informative facial features for a (V)LR CNNmodel, in order to enhance the (V)LR features and the clas-siﬁcation performance.In the literature, VLR recognition algorithms haveshown to beneﬁt from HR samples by learning shared rep-resentations between the HR and VLR samples [30] or by a) (c)(b) Figure 2: Sample HR and VLR images from the (a) SVHNdataset, (b) CMU Multi-PIE dataset, and (c) UCCS dataset.The HR images (ﬁrst row) contain high information content,which is often missing in the VLR samples (second row).transferring the model information learned by the HR dataonto the VLR recognition model [4]. By utilizing the addi-tional information from the HR images at the time of train-ing, such algorithms are able to learn more discriminativeand meaningful features, as compared to those learned inde-pendently from the VLR images. This research proposes toutilize the auxiliary HR samples during training to direct theVLR features towards the more informative HR features,via a novel DirectCapsNet model.

3. Proposed Dual Directed Capsule Network

As shown in Figure 2, the problem of very low resolu-tion (VLR) recognition suffers from the challenge of lim-ited information content in the input images, which oftenresults in lack of discriminative features useful for recog-nition/classiﬁcation. In order to overcome this challenge,we propose a novel Dual Directed CapsNet, termed as

Di-rectCapsNet . DirectCapsNet enhances the VLR represen-tations by directing them in two ways: via the proposed (i)HR-anchor loss and (ii) targeted reconstruction loss, both ofwhich provide additional supervision using the HR images.The HR information is used to direct/guide the frameworkto extract discriminative representations even from the VLRimages having limited information content. This is accom-plished by using the HR-anchor loss which brings the rep-resentations of VLR images closer to the representations oftheir corresponding HR samples. This is also enforced atthe classiﬁcation stage via the targeted reconstruction loss,which promotes similar features for HR and VLR samplesof the same class. Since the base architecture of the pro-posed model is a capsule network, we ﬁrst brieﬂy explainits functioning, followed by the in-depth explanation of theproposed model.

Hinton et al. [7] proposed the concept of capsules asan effective method of learning representations. It was fur-ther developed by Sabour et al. [22], where a capsule net-work (CapsNet) is presented for classiﬁcation. A capsuleis a “group of neurons whose activity vector represents theinstantiation parameters of a speciﬁc type of entity such asan object or an object part”. In other words, instead of a sin-gle scalar output, each capsule outputs a vector, the valuesof which are referred to as the activity vector. The length ofeach capsule vector ( (cid:107) . (cid:107) ) is bounded in the range of [0 − .Sabour et al. [22] proposed the concept of dynamic routingbetween capsules, wherein multiple layers of capsules werestacked for object classiﬁcation. The ﬁnal layer contains theclassiﬁcation capsules of dimension k × m , where k is thenumber of classes and m is the capsule dimension. For agiven input, the predicted class is the class corresponding tothe capsule with the maximum activity vector (length). Inorder to learn an effective classiﬁcation model, margin lossis used to learn the network. Given a K class problem, with v x c k as the output of the k th class capsule for an input x c (belonging to class c ), and T k being the label correspondingto the k th class, the margin loss of CapsNet is deﬁned as: L Margin = K (cid:88) k =1 (cid:0) T k max(0 , m + − (cid:107) v x c k (cid:107) ) + λ (1 − T k ) max(0 , (cid:107) v x c k (cid:107) − m − ) (cid:1) (1)where, T k ∈ { , } , that is, whether the input sample be-longs to class k ( T k = 1 ) or not ( T k = 0 ). m + and m − correspond to the positive and negative margin used to in-crease the intra-class similarity and reduce the inter-classsimilarity, respectively, and λ is a constant for controllingthe weight of each term. The above loss (Equation 1) pro-motes a larger length of capsule ( (cid:107) v k (cid:107) ) for the correct class,and a smaller length for capsules corresponding to the otherclasses. Capsule networks are relatively less explored in theliterature, with limited or no modiﬁcation to the architec-ture or loss function. They have been used for brain tumordetection [1], sea grass detection [9], generating syntheticdata [10], and image classiﬁcation [32]. Capsule networksencode the instantiation parameters for a given input, andthus present the potential of being the appropriate networkfor VLR image recognition. As shown in Figure 3, the proposed DirectCapsNet net-work can be broken down into three components: (i) input,(ii) feature extraction, and (iii) classiﬁcation. At the time oftraining, the input consists of both HR and VLR samples.The feature extraction module consists of convolutional lay-ers and the proposed HR-anchor loss, and the classiﬁcation

R and VLR Images Conv256x9x9 (stride=1)

ReLUBatchNorm

Classification

Capsule … Class 1Class 2

Class 10

8x 32x(9x9) (stride=2)256 Capsules32 10x16

Class 3

Target HR ReconstructionsFeature Extraction Reconstruction ModulePrimary Capsule

HR FeaturesVLR FeaturesHR Anchor

Proposed HR-Anchor Loss x y (i) Features x y (ii) HR-anchor Computation x y (iii) Learned Features Figure 3: Architecture of the proposed Dual Directed Capsule Network (DirectCapsNet) for the SVHN dataset [17]. Adiagrammatic representation of the HR-Anchor loss is presented for a given class. HR images are used to complement thefeatures learned by the VLR recognition model by directing the model to learn discriminative and information rich features.module consists of a capsule network coupled with the pro-posed targeted reconstruction loss. By enforcing dual direc-tion via the proposed (i) HR-anchor loss and (ii) targeted re-construction, the proposed DirectCapsNet focuses on learn-ing meaningful feature-rich representations for VLR inputs,aided by the auxiliary HR samples. The loss function of theproposed DirectCapsNet is formulated as: L DirectCapsNet = L Margin + λ L HR − anchor + λ L T − Recon (2)where, λ and λ are used to balance the weights of theHR-anchor and targeted reconstruction loss with respect tothe margin loss. The margin loss introduces discriminabil-ity between classes, while the HR-anchor loss and targetedreconstruction loss enforce information-rich representationsat the feature and classiﬁcation level. At the time of testing,for a given VLR input, the class capsule with the highestlength is chosen as the class of the given input. It is essentialto note that, simulating the real world scenarios, DirectCap-sNet utilizes the HR samples only at the time of training,and operates with a given VLR image during testing. Aswill be demonstrated in the remainder of this section, eachcomponent of the proposed model facilitates learning dis-criminative features for VLR recognition. Proposed HR-anchor Loss:

Input samples in Figure 3 areHR ( × × ) and VLR ( × × resolution upscaledto HR resolution) images from the SVHN dataset [17]. Thelimited information content in VLR images makes it dif-ﬁcult to extract discriminative information, often resultingin ineffective recognition, a phenomenon observed in hu- mans as well [27]. The proposed HR-anchor loss addressesthis challenge by pushing VLR features closer to their HRcounter parts. This ensures learning of a discriminativespace for VLR recognition, even with limited information.For an input x c belonging to class c , with features f x c learned from the convolutional layers, the HR-anchor lossis formulated as: L HR − anchor = 12 (cid:0) (1 − r x c ) (cid:107) f x c − A c (cid:107) + r x c (cid:107) f x c − A c (cid:107) (cid:1) (3)where, r x c is a binary variable to denote the resolution ofthe sample, i.e., r x c = 1 for a HR sample, and r x c = 0 for a VLR sample. Since HR samples are only used duringtraining, this information is readily available. f x c refers tothe features extracted from the convolutional layers in thefeature module, A c and A c both refer to the HR-anchor ofclass c , which is used to enhance the VLR representations.Speciﬁcally, A c refers to the HR-anchor in a constant state,whereas A c represents the HR-anchor in a parameter form,which needs to be optimized. The HR-anchor of a partic-ular class corresponds to the average feature vector of allHR samples belonging to that class. Given a VLR sam-ple ( r x c = 0 ), the ﬁrst part of Equation 3 ( (cid:107) f x c − A c (cid:107) )is active, where the HR-anchor of class c assists the VLRfeature f x c to be closer to the anchor, thereby facilitatinglearning of discriminative features useful for classiﬁcation.For a HR sample ( r x c = 1 ), the second half of Equation 3( (cid:107) f x c − A c (cid:107) ) becomes active, where both the HR-anchorand features are updated.The proposed HR-anchor loss is a combination of learn-ng the HR-anchors and learning the VLR features closerto the HR feature space, in order to learn discriminativeVLR features. The ﬁrst term attempts to direct the VLRfeatures towards the HR anchors, and the second termlearns representative HR anchors from the HR features.It is important to note that there is no contribution ofthe VLR features in the anchor generation, since the HRanchors are constant in the ﬁrst term. This ensures that theVLR features are directed towards the higher quality HRfeatures, and not the other way round. Therefore, Equation3 promotes the learning of informative VLR features withassistance from the HR samples. Proposed Targeted Reconstruction Loss:

The secondform of direction is imposed via the targeted reconstruc-tion loss (Figure 3) at the classiﬁcation module (capsule net-work). The targeted reconstruction loss promotes learningsimilar classiﬁcation capsules for HR and VLR samples. Asexplained previously, a capsule is a vector which encodesthe instantiation parameters of the input sample [22]. Fora given input, the activations of a capsule are termed as theactivity vector. For reconstruction, only the activity vectorof the target class is selected and used to reconstruct the in-put sample. For an input image x c belonging to class c , thereconstruction loss is mathematically formulated as: L Recon = 12 (cid:107) x c − g ( v x c c ) (cid:107) (4)where, v x c c is the activity vector of the classiﬁcation capsuleof the c th class for the input x c , and g ( . ) refers to the re-construction network. The reconstruction loss attempts toencode instantiation parameters that are able to explain theinput image, and thus are able to reconstruct the input. In-tuitively, we believe that the instantiation parameters of aHR sample and its corresponding VLR sample should besimilar. Therefore, in order to incorporate a second level ofdirection, the targeted reconstruction loss is introduced inthe proposed DirectCapsNet.The targeted reconstruction loss enforces the HRcounter-part of a VLR image at the output of the recon-struction network. Regardless of a HR or a VLR input, thereconstructed sample is forced as a HR image. For an input x c , the targeted reconstruction loss can be written as: L T − Recon = (cid:107) hr x c − g ( v x c c ) (cid:107) (5)where, hr x c is the HR image corresponding to the inputHR/VLR sample and v x c c is the activity vector of the c th class. In case of a HR input image, Equation 5 ensures thatthe HR input is reconstructed at the output of the recon-struction network. For a VLR image, its HR counter-part isprovided as the target of the reconstruction network. Sincethe reconstruction network operates on the ﬁnal classiﬁca-tion capsule, the targeted reconstruction loss pushes the HR and VLR samples to have a similar capsule activity vector,driven by the HR samples. Therefore, the reconstructionloss promotes learning similar capsule features for HR andVLR samples directly at the classiﬁcation stage, by direct-ing the model to reconstruct a HR sample from an extractedVLR feature.Equations 3 and 5 are combined to update Equation 1and the loss function of the proposed DirectCapsNet for aninput x c (belonging to class c ) is written as: L DirectCapsNet = K (cid:88) k =1 (cid:16) T k max(0 , m + − (cid:107) v x c k (cid:107) ) + λ (1 − T k ) max(0 , (cid:107) v x c k (cid:107) − m − ) (cid:17) + 12 (cid:16) λ (1 − r x c ) (cid:107) f x c − A c (cid:107) + λ r x c (cid:107) f x c − A c (cid:107) + λ (cid:107) hr x c − g ( v x c c ) (cid:107) (cid:17) (6) DirectCapsNet has been implemented in Python, usingthe PyTorch framework on the NVIDIA Tesla P-100 GPU.Adam optimizer [12] has been used for learning the model.The weight of the HR-anchor loss ( λ of Equation 6) is setto e − , and the weight of the targeted reconstruction loss( λ of Equation 6) is set to e − . The positive and nega-tive margins for the margin loss ( m + and m − of Equation1) are set to 0.9 and 0.1, respectively. As shown in Figure 3,for all the experiments, the DirectCapsNet model contains n convolution layers, followed by two capsule layers. TheHR-anchor loss is applied on the ﬁnal convolution layer ofthe DirectCapsNet. The ﬁnal capsule layer is connected to areconstruction network of three fully connected layers. Forcases where the HR samples are larger than × , threeconvolutional layers with [16 , , ﬁlters are used with abatch size of 32 samples. In cases where the HR samples aresmaller, a convolutional layer with 128 ﬁlters is used witha batch size of 100 samples. ReLU activation function isused between the convolutional layers along with batch nor-malization [8]. All models have been trained from scratchand no pre-trained networks have been used.

4. Experiments and Protocols

The proposed DirectCapsNet has been evaluated forthree very low resolution (VLR) recognition problems: (i)VLR digit recognition, (ii) VLR face recognition, and (iii)unconstrained VLR face recognition. Details regarding thedataset and protocols for each case study are as follows:

Case study 1 - VLR Digit Recognition:

The Street ViewHouse Numbers (SVHN) dataset [17] has been used forVLR digit recognition. The dataset contains real-world im-ages of digits in the range [0 − . Pre-deﬁned bench-mark protocol has been used for the given 10-class problem,wherein 73,257 digits are used for training and 26,032 digitsre used for testing. For VLR recognition, consistent withthe existing protocol [30], × HR images are used, and × VLR images are used. Results are reported in termsof the top-1 and top-5 accuracies.

Case study 2 - VLR Face Recognition:

VLR face recog-nition has direct applicability in scenarios of image taggingor situations where multiple people are captured in a sin-gle image. For this particular case-study, experiments havebeen performed on the CMU Multi-PIE dataset [6] whichsimulates a constrained setting. Consistent with the exist-ing protocol [25], 237 subjects are used. One image persubject is added to the training set/gallery which consistsof the HR images, and one image per subject is added tothe testing set/probe (VLR). The HR images are of × resolution and the VLR images are of × and × ,respectively. Results are reported using the rank-1 identiﬁ-cation accuracy. Case study 3 - Unconstrained VLR Face Recognition:

Unconstrained VLR face recognition has wide applicabilityin surveillance scenarios, where the VLR face image oftencontains other variations such as pose, illumination, and oc-clusion. Experiments have been performed on two datasets:(a) UnConstrained College Students (UCCS) dataset [24]for an unconstrained surveillance setting and (b) CMUMulti-PIE dataset [6] with pose and illumination variationsfor a semi-constrained setting.The UCCS dataset contains images of college students, cap-tured using a long-range high-resolution surveillance cam-era kept at a standoff distance of 100 to 150 meters. Theimages show students walking around the campus, betweenclasses. The large standoff distance and unconstrained na-ture of the data simulates real world surveillance settings.The dataset contains a labeled subset of 1732 identities.Consistent with the existing protocol [4, 30], a subset con-taining the top 180 identities (in terms of the number ofimages) is used for evaluation. As per the protocol, eachsubject’s images are divided into a ratio correspondingto training:testing. The VLR images are of × resolu-tion, whereas the HR images are of × pixels.As described above, CMU Multi-PIE dataset [6] containsimages with pose, expression, and illumination variations.As per the existing protocol [16], in this case-study facerecognition is performed across pose and illumination vari-ations for VLR images. Images pertaining to 50 subjectsare used for training and images of the remaining subjectsform the test set. In our experiments, we do not utilize thetraining set and only use the gallery images of the test setin order to train the proposed DirectCapsNet model. Thegallery comprises of the frontal images (used for trainingthe proposed model), and the probe (test set) are imageshaving a different pose (‘05 0’ of the dataset) and illumina-tion. Experiments are performed across ﬁve different pairsof illumination conditions and average rank-1 identiﬁcation Table 1: Top-1 and top-5 accuracy (%) on the SVHN dataset[17] for VLR digit recognition ( × ). Algorithm Accuracy (%)

Top-1 Top-5CNN (VLR) (2016) [30] 45.29 66.78RPC Nets (2016) [30] 56.98 70.82 P r opo s e d CapsNet (HR) 77.82 87.86CapsNet (VLR) 79.19 88.89DirectCapsNet - (HR-anchor Loss) 82.42 90.15DirectCapsNet - (Targeted Recon.) 81.95 90.35

Proposed DirectCapsNet 84.51 91.20 accuracy has been reported. Consistent with [16], the HRimages are of × resolution, while VLR images haveresolution of × , × , × , and × .Figure 2 presents some HR and VLR images from thedatasets used in the three case-studies. Bicubic interpola-tion is used for conversion from HR to VLR and vice-versa.At the time of training, the HR and VLR pairs are usedfor the targeted reconstruction loss. Data augmentation isapplied by introducing brightness variations, ﬂipping alongthe y-axis, and random crops. At the time of testing, onlythe VLR image is provided for classiﬁcation.

5. Results and Analysis

Tables 1 - 3 and Figures 4 - 6 present the results forthe three case-studies: (i) VLR digit recognition, (ii) VLRface recognition, and (iii) unconstrained VLR face recog-nition. Analysis of the proposed DirectCapsNet has alsobeen performed in order to demonstrate the effectiveness ofeach component. Since existing protocols have been usedfor analysis, results have directly been reported from the re-spective publications.

Ablation Study and Analysis of DirectCapsNet:

Exper-iments have been performed on the SVHN dataset to ana-lyze each component of the proposed DirectCapsNet, andmotivate their inclusion in the ﬁnal model. As observedfrom Table 1, the native CapsNet model (having the mar-gin loss) when trained on VLR images (CapsNet (VLR))attains the top-1 classiﬁcation accuracy of 79.19%, whichdemonstrates large improvement over the native CNN ar-chitecture (45.29%) [30]. The improved performance pro-motes the usage of capsule networks for the task of VLRrecognition. Consistent with literature [22], it is our beliefthat since CapsNet attempts to encode the instantiation pa-rameters of the data, it results in learning features invariantto minor variations, a desirable property of a robust VLRrecognition module.Further, in order to reafﬁrm the necessity of a VLRrecognition model, a CapsNet with the same architectureis trained on HR images only. In this case, the model a) VLR Input (b) Reconstructions

Figure 4: Sample reconstructions obtained on the SVHNdataset from VLR input. DirectCapsNet is able to recon-struct digits where limited information content is available(e.g. green boxes), however it also fails to correctly recon-struct some challenging cases (e.g. red boxes).does not see any VLR images at the time of training andis evaluated on VLR test images. As can be observed,the CapsNet (HR) achieves a classiﬁcation accuracy of77.82%, thus reafﬁrming the need to develop dedicatedVLR recognition networks or utilize task-speciﬁc infor-mation while training. We also performed the McNemartest [15] and achieved statistical difference at a conﬁ-dence interval (C.I.) of 99% ( p -value < Case study 1 - VLR Digit Classiﬁcation:

Table 1 presentsthe top-1 and top-5 classiﬁcation accuracy for the SVHNdataset of the proposed DirectCapsNet and comparisonwith other techniques. The proposed DirectCapsNet modelachieves top-1 accuracy of 84.51% and top-5 accuracy of91.20%. DirectCapsNet demonstrates an improvement ofover 27% at top-1 with respect to the state-of-the-art resultsof Robust Partially Coupled Networks (RPC Nets) [30],which is a CNN based framework to learn partial sharedweights for VLR and HR samples, and partial independentweights for the two. The superior performance of the pro-posed DirectCapsNet model motivates its usage for VLRrecognition. Figure 4 presents sample reconstructions ob-tained from the DirectCapsNet for × VLR samples. It ismotivating to note that the DirectCapsNet model is able toreconstruct the digits for the input samples, which motivatesthe inclusion of the targeted reconstruction loss. Similar re-constructions are obtained for samples of the same class,which demonstrate the effectiveness of the HR-anchor lossfor increasing the intra-class similarity between features.

Case study 2 - VLR Face Recognition:

Table 2 presents (a) VLR Input(b) Reconstruction for VLR Input (c) HR Input(d) Reconstruction for HR Input

Figure 5: Sample reconstructions obtained from the pro-posed DirectCapsNet model on the CMU Multi-PIE dataset.For the same class, DirectCapsNet is able to project VLRand HR samples onto a similar target, suggesting robustresolution-invariant feature representations.Table 2: Rank-1 accuracy (%) for VLR recognition on theCMU Multi-PIE dataset [6]. The HR images are of × resolution. Algorithm Accuracy (%) × × Original + COTS (2018) [25] 0.0 0.0Bicubic Interp. + COTS (2018) [25] 0.1 1.1SHSR (Synthesis + COTS) (2018) [25] 82.6 91.1

Proposed DirectCapsNet 94.5 97.4

Table 3: Rank-1 accuracy (%) on the UCCS dataset [24]for VLR face recognition ( × ). The HR images are of × resolution. Algorithm Acc. (%)

Robust Partially Coupled Nets (2016) [30] 59.03Selective Knowledge Distillation (2019) [4] 67.25LMSoftmax for VLR (2019) [14] 64.90L2Softmax for VLR (2019) [14] 85.00Centerloss for VLR (2019) [14] 93.40

Proposed DirectCapsNet 95.81 the rank-1 identiﬁcation (or top-1 recognition) accuracyfor two protocols of VLR face recognition. The proposedDirectCapsNet model achieves an accuracy of 94.5% and97.4% for × and × VLR images, while having theHR auxiliary images as × (Table 2) on the constrainedCMU Multi-PIE dataset. DirectCapsNet demonstrates animprovement of almost 12% as compared to the state-of-the-art (Synthesis via Hierarchical Sparse Representations(SHSR)) [25] for × resolution images. Figure 5 presentssample VLR and HR face images, along with the recon-structions obtained from the DirectCapsNet. The proposedmodel is able to reconstruct faces belonging to the samesubject onto a similar target, suggesting high within-classsimilarity. Both VLR and HR samples are reconstructed assimilar images, which reinstates the beneﬁt of the targetedreconstruction and HR-anchor loss. A cc u r a c y ( % ) ResolutionHR-LR (MDS) Mudunuri and Biswas Proposed DirectCapsNet

Figure 6: Performance of the proposed DirectCapsNet forvarying resolutions of VLR face recognition with pose andillumination variations. The HR resolution was ﬁxed to × pixels. Comparison has been shown with HR-LR (MDS)[3] and Mudunuri and Biswas [16]. Test Image Number S c o r e Genuine ClassImposter Class

Figure 7: Scores obtained by the proposed DirectCapsNetfor VLR recognition on some samples of the UCCS dataset.Each test image has one genuine score (correct class) and179 imposter scores (incorrect class).

Case study 3 - Unconstrained VLR Face Recognition:

Table 3 and Figure 6 present the rank-1 identiﬁcation (ortop-1 recognition) accuracy for unconstrained VLR facerecognition. As shown in Table 3, on the UCCS dataset, Di-rectCapsNet model achieves a rank-1 accuracy of 95.81%demonstrating an improvement of almost 2.5% over thestate-of-the-art network and almost 10% from the currentsecond best [14]. Comparison has also been performed withthe recently proposed large-margin softmax (LMSoftmax), l -constrained softmax (L2Softmax), and center-loss basedVLR recognition systems [14]. The improved performanceof the proposed DirectCapsNet over metric learning tech-niques demonstrates the beneﬁt of incorporating auxiliaryHR information to provide direction while training with theproposed dual directed loss functions. Figure 7 presents the scores obtained on samples of the UCCS dataset by the Di-rectCaspNet model. The scores correspond to the lengthof the activity vectors of the capsules used for classiﬁca-tion. Figure 7 suggests that the model is able to generatea high score for the correct class and a small score for theother classes, which promotes separability, resulting in highrecognition performance.Similar performance is obtained on the CMU Multi-PIE dataset (Figure 6) with pose and illumination varia-tions, where the proposed DirectCapsNet achieves an av-erage recognition performance of 95.17%, demonstratingan improvement of around 1.64% from the current state-of-the-art algorithm [16]. Figure 6 demonstrates that theproposed DirectCapsNet does not suffer a major decrease inaccuracy as other techniques with reducing the resolution.The model achieves the recognition accuracy of 92.15% and90.34% for × and × , respectively, whereas, thesecond best performing model [16] shows a drop of almost9% between the two resolutions. Improved recognition per-formance across multiple very low resolutions motivates theapplicability of the proposed DirectCapsNet model for realworld scenarios.

6. Conclusion

Existing research has primarily focused on high reso-lution and low resolution image recognition; however, theproblem of VLR recognition has received limited attention.VLR recognition, an arduous problem with wide applica-bility in real world scenarios, suffers from the primary chal-lenge of low information content. This research presents anovel Dual Directed Capsule Network (DirectCapsNet) forVLR recognition. The DirectCapsNet combines the mar-gin loss for classiﬁcation with the proposed HR-anchor lossand the targeted reconstruction loss for enhancing the VLRfeatures. HR images are used during training as ‘auxil-iary’ data to complement the VLR feature learning. Exper-imental results on VLR digit recognition (SVHN database)and constrained/unconstrained VLR face recognition (CMUMulti-PIE and UCCS databases) demonstrate the efﬁcacy ofthe proposed model, and promote its usability for differentVLR tasks. In future, we plan to extend the proposed al-gorithm to address multiple covariates; for example, in facerecognition applications, VLR recognition in the presenceof disguise [26], aging [21], spectral variations [19], andadversarial attacks [5].

7. Acknowledgement

This research is partially supported through the InfosysCenter for Artiﬁcial Intelligence, IIIT-Delhi, India. M.Vatsa is also supported through the Swarnajayanti Fellow-ship by Government of India. S. Nagpal is supported via theTCS PhD fellowship. eferences [1] Parnian Afshar, Arash Mohammadi, and Konstantinos N Pla-taniotis. Brain tumor type classiﬁcation via capsule net-works. In

IEEE International Conference on Image Process-ing , pages 3129–3133, 2018. 3[2] Himanshu S. Bhatt, Richa Singh, Mayank Vatsa, andNalini K. Ratha. Improving cross-resolution face matchingusing ensemble-based co-transfer learning.

IEEE Transac-tions on Image Processing , 23(12):5654–5669, 2014. 2[3] Soma Biswas, Gaurav Aggarwal, Patrick J Flynn, andKevin W Bowyer. Pose-robust recognition of low-resolutionface images.

IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 35(12):3037–3049, 2013. 8[4] Shiming Ge, Shengwei Zhao, Chenyu Li, and Jia Li. Low-resolution face recognition in the wild via selective knowl-edge distillation.

IEEE Transactions on Image Processing ,28(4):2051–2062, 2019. 2, 3, 6, 7[5] Gaurav Goswami, Akshay Agarwal, Nalini Ratha, RichaSingh, and Mayank Vatsa. Detecting and mitigating adver-sarial perturbations for robust face recognition.

InternationalJournal of Computer Vision , 127(6):719–742, 2019. 8[6] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade,and Simon Baker. Multi-PIE.

Image and Vision Computing ,28(5):807–813, 2010. 2, 6, 7[7] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang.Transforming auto-encoders. In

International Conference onArtiﬁcial Neural Networks , pages 44–51, 2011. 2, 3[8] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. arXiv preprint arXiv:1502.03167 , 2015. 5[9] Kazi Aminul Islam, Daniel P´erez, Victoria Hill, Blake Scha-effer, Richard Zimmerman, and Jiang Li. Seagrass detec-tion in coastal water through deep capsule networks. In

Chi-nese Conference on Pattern Recognition and Computer Vi-sion , pages 320–331, 2018. 3[10] Ayush Jaiswal, Wael AbdAlmageed, Yue Wu, and Premku-mar Natarajan. CapsuleGAN: Generative adversarial cap-sule network. In

European Conference on Computer Vision ,pages 526–535, 2018. 3[11] Muwei Jian and Kin-Man Lam. Simultaneous hallucinationand recognition of low-resolution faces based on singularvalue decomposition.

IEEE Transactions on Circuits andSystems for Video Technology , 25(11):1761–1772, 2015. 2[12] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. 5[13] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero,Andrew Cunningham, Alejandro Acosta, Andrew Aitken,Alykhan Tejani, Johannes Totz, Zehan Wang, and WenzheShi. Photo-realistic single image super-resolution using agenerative adversarial network. In

IEEE Conference onComputer Vision and Pattern Recognition , pages 105–114,2017. 2[14] Pei Li, Loreto Prieto, Domingo Mery, and Patrick J Flynn.On low-resolution face recognition in the wild: Compar-isons and new techniques.

IEEE Transactions on Informa- tion Forensics and Security , 14(8):2000–2012, 2019. 1, 2, 7,8[15] Quinn McNemar. Note on the sampling error of the differ-ence between correlated proportions or percentages.

Psy-chometrika , 12(2):153–157, 1947. 7[16] Sivaram Prasad Mudunuri and Soma Biswas. Low resolu-tion face recognition across variations in pose and illumina-tion.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 38(5):1034–1040, 2016. 2, 6, 8[17] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-sacco, Bo Wu, and Andrew Y. Ng. Reading digits in naturalimages with unsupervised feature learning. In

NIPS Work-shop on Deep Learning and Unsupervised Feature Learning ,2011. 1, 2, 4, 5, 6[18] Shuxin Ouyang, Timothy Hospedales, Yi-Zhe Song, Xuem-ing Li, Chen Change Loy, and Xiaogang Wang. A survey onheterogeneous face recognition: Sketch, infra-red, 3D andlow-resolution.

Image and Vision Computing , 56:28 – 48,2016. 2[19] Shuxin Ouyang, Timothy Hospedales, Yi-Zhe Song, Xuem-ing Li, Chen Change Loy, and Xiaogang Wang. A surveyon heterogeneous face recognition: Sketch, infra-red, 3Dand low-resolution.

Image and Vision Computing , 56:28–48,2016. 8[20] Sung Cheol Park, Min Kyu Park, and Moon Gi Kang. Super-resolution image reconstruction: a technical overview.

IEEESignal Processing Magazine , 20(3):21–36, 2003. 2[21] Narayanan Ramanathan and Rama Chellappa. Face veriﬁ-cation across age progression.

IEEE Transactions on ImageProcessing , 15(11):3349–3361, 2006. 8[22] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dy-namic routing between capsules. In

Advances in Neural In-formation Processing Systems , pages 3856–3866, 2017. 3,5, 6[23] Mehdi S. M. Sajjadi, Bernhard Scholkopf, and MichaelHirsch. Enhancenet: Single image super-resolution throughautomated texture synthesis. In

IEEE International Confer-ence on Computer Vision , pages 4501–4510, 2017. 2[24] Archana Sapkota and Terrance E Boult. Large scale uncon-strained open set face database. In

IEEE International Con-ference on Biometrics: Theory, Applications and Systems ,pages 1–8, 2013. 1, 2, 6, 7[25] Maneet Singh, Shruti Nagpal, Mayank Vatsa, Richa Singh,and Angshul Majumdar. Identity aware synthesis for crossresolution face recognition. In

IEEE Conference on Com-puter Vision and Pattern Recognition Workshops , pages 592–59209, 2018. 1, 2, 6, 7[26] Maneet Singh, Richa Singh, Mayank Vatsa, Nalini K. Ratha,and Rama Chellappa. Recognizing disguised faces in thewild.

IEEE Transactions on Biometrics, Behavior, and Iden-tity Science , 1(2):97–108, 2019. 8[27] Pawan Sinha, Benjamin Balas, Yuri Ostrovsky, and RichardRussell. Face recognition by humans: Nineteen results allcomputer vision researchers should know about.

Proceed-ings of the IEEE , 94(11):1948–1962, 2006. 4[28] Nannan Wang, Dacheng Tao, Xinbo Gao, Xuelong Li, andJie Li. A comprehensive survey to face hallucination.

In-ernational Journal of Computer Vision , 106(1):9–30, 2014.2[29] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy.Recovering realistic texture in image super-resolution bydeep spatial feature transform. In

IEEE Conference on Com-puter Vision and Pattern Recognition , pages 606–615, 2018.2[30] Zhangyang Wang, Shiyu Chang, Yingzhen Yang, Ding Liu,and Thomas S Huang. Studying very low resolution recogni-tion using deep networks. In

IEEE Conference on ComputerVision and Pattern Recognition , pages 4792–4800, 2016. 2,6, 7[31] Zhifei Wang, Zhenjiang Miao, Q. M. Jonathan Wu, YanliWan, and Zhen Tang. Low-resolution face recognition: areview.

The Visual Computer , 30(4):359–386, 2014. 2[32] Canqun Xiang, Lu Zhang, Yi Tang, Wenbin Zou, and ChenXu. Ms-capsnet: A novel multi-scale capsule network.

IEEESignal Processing Letters , 25(12):1850–1854, 2018. 3[33] Wilman WW Zou and Pong C Yuen. Very low resolutionface recognition problem.