Unsupervised Pre-trained, Texture Aware And Lightweight Model for Deep Learning-Based Iris Recognition Under Limited Annotated Data
Manashi Chakraborty, Mayukh Roy, Prabir Kumar Biswas, Pabitra Mitra
UUNSUPERVISED PRE-TRAINED, TEXTURE AWARE AND LIGHTWEIGHT MODEL FORDEEP LEARNING BASED IRIS RECOGNITION UNDER LIMITED ANNOTATED DATA
Manashi Chakraborty †‡ Mayukh Roy ‡ Prabir Kumar Biswas Pabitra Mita
Indian Institute of Technology, Kharagpur, India
ABSTRACT
In this paper, we present a texture aware lightweight deeplearning framework for iris recognition. Our contributions areprimarily three fold. Firstly, to address the dearth of labellediris data, we propose a reconstruction loss guided unsuper-vised pre-training stage followed by supervised refinement.This drives the network weights to focus on discriminative iristexture patterns. Next, we propose several texture aware im-provisations inside a Convolution Neural Net to better lever-age iris textures. Finally, we show that our systematic train-ing and architectural choices enable us to design an efficientframework with upto 100 × fewer parameters than contem-porary deep learning baselines yet achieve better recognitionperformance for within and cross dataset evaluations. Index Terms — Iris Recognition, Deep Learning, CNN,Texture, Lightweight
1. INTRODUCTION
Iris biometrics, over the last few years have shown immensepotential as an infallible biometric recognition system [1, 2,3, 4, 5]. Iris textures are highly subject discriminative [2] andbeing an internal organ of the eye, it is resilient to environ-mental perturbations and is also immutable over time.The initial works on iris recognition focused on designingtraditional hand engineered features [1, 6, 7, 8]. Recent suc-cess over a variety of vision applications on natural images[9, 10] showcases the unprecedented advantage of deep Con-volution Neural Networks (CNNs) over hand-crafted features.Inspired by the success of CNNs, iris biometric communityalso started exploring the prowess of deep learning. An appre-ciable gain in performance [11, 12, 13] is observed comparedto traditional methods. However, some intrinsic issues suchas absence of large annotated datasets, explicit processing oftexture information and lightweight architecture design havehardly been addressed. In this paper, we address the aboveconcerns with several systematic modifications over conven-tional CNN training pipelines and architectural choices.
Handling Absence of Large Dataset:
CNNs are data greedyand usually require millions of annotated data for fruitful † All correspondence to : [email protected] ‡ Denotes equal contribution training. This is not an issue for natural images where datasetssuch as Imagenet [14], MS-COCO [15] contain large volumesof annotated data. However, for iris biometrics, the sizes ofthe datasets are usually limited to few thousands. Thus, thisshort-coming still remains an open challenge for deep learn-ing based iris biometric researchers. In this paper, we addressthis problem with a two-stage training strategy. In the firststage, we pre-train a parameterized feature encoder, E θ ( · ) ,to capture iris texture signatures in an unsupervised train-ing framework. In the second stage, E θ ( · ) acts as a featureextractor and is further refined along with a classificationhead, C ψ ( · ) . We show that the combined training frame-work provides significant boost in performance comparedto single stage training. Further, visualization with LayerWise Relevance Propagation [16] shows that as opposed tosingle-stage training, our proposed stage-wise training drivesthe network weights to focus more on the iris textures. Thisfurther motivated us in designing systematic texture attentivearchitectural choices as mentioned below. Energy Aware Pooling:
Non-parametric spatial sub-sampling(usually realised as Max-pooling) in conventional deep net-works is a crucial and essential component fairly used toretain the maximum response of a specified window. Inthis paper, we show that on a texture-rich iris [2] dataset,sub-sampling using Energy Aware Pooling (
EAP ) is a betteralternative to max ( · ) operation. Texture Energy Layer:
Usually in deep networks, it is acommon practise to have several fully-connected layers at theend to amalgamate global structure information. However,iris images are mainly rich in local textures. Toward this, wepropose to use Texture Energy Layer (
TEL ) to specificallycapture energy of the last convolutional filter bank responses.Such energy based features have been traditionally used fortexture classification [17, 18, 19].
Light-weight Model for Inference:
The systematic designstrategies enable us to operate with much shallower architec-ture yet achieve better performance than the deeper baselines.Additionally,
TEL layer obviates the requirement of com-putationally heavy penultimate fully-connected layer of ourproposed base architecture. As a consequence, our model hassignificantly less parameter counts. This is particularly im-portant since iris biometrics is gradually becoming an integralcomponent of many handheld mobile devices. a r X i v : . [ c s . C V ] F e b ig. 1 . Stagewise training framework of proposed framework of
CombNet
EAP + TELEθ
Our above proposed architectural choices consistentlyoutperforms traditional as well as recent deep nets by a note-worthy margin. Even in scenarios where target dataset isdifferent from training data, our proposed model generaliseswith better performance without the need of even fine-tuningon the target data.
2. RELATED WORK
Initial attempts of iris recognition were primarily inclined to-wards traditional techniques of extracting features from vari-ous filter bank responses. Daugman [1] extracted representa-tive iris features from responses of 2-D Gabor filters. Masek et al. extracted response from 1D Log Gabor filters [6]. Ma et al. [8] proposed a bank of circularly symmetric sinusoidalmodulated Gaussian filters banks to capture the discriminativeiris textures. Wildes et al. [5] extracted discriminative iris tex-tures from multi-scale Laplacian of Gaussian (LOG). Monro et al. used features from Discrete Cosine Transform (DCT)[7]. To summarize, the earlier works mainly focused on hand-crafted feature representation. Initial attempts [11, 13] ofleveraging deep learning for iris recognition involved featureextraction using well known pre-trained (for ImageNet classi-fication) neural networks followed by a supervised classifica-tion stage. Recently, Gangwar et al. [12] proposed DeepIris-Net, which is an to end-to-end trainable (from scratch) deepneural network and achieved appreciable boost over the tradi-tional methods.
3. METHODOLOGY3.1. Network Architecture
Stage-1:
In the first phase, we follow an unsupervised frame-work for pre-training a feature encoder, E θ ( · ) to capturetexture signatures. For this, we train a convolutional auto-encoder with reconstruction loss, L R . Specifically, given anormalised iris image, I (an example of normalised iris im-age, I is shown in Figure 1), we project it to a smaller resolu-tion (by strided convolution and spatial sub-sampling) usingthe encoder and then decode it back to the original resolu-tion with a decoder, D φ ( · ) . Configurations of various lay-ers of encoder, E θ ( · ) and decoder, D φ ( · ) is shown in Table subscript θ refers to set of trainable parameters Fig. 2 . Relevance map (red is most important while blue is least) of three different iriscorresponding to three classes of the CASIA.v4-Distance dataset.
Row 1:
Normalisediris image.
Row 2:
Relevance map of
CombNet R (randomly initialised encoder). Row 3:
Relevance map of
CombNet Eθ (initialised with pre-trained encoder). L R is thus applied between original image, I and recon-structed image, ˆ I = D φ ( E θ ( I )) . In this paper, we have usedthe Structural Similarity (SSIM) metric ∈ { , } as a proxyfor gauging the similarity between original and reconstructedimage. So, we minimise the following: L R = 1 − SSIM ( I, D φ ( E θ ( I ))) . (1) Stage-2
CombNet : In the second stage, activations of E θ ( · ) is passed to the classification branch, C ψ ( · ) . Follow-ing the usual trend, the baseline C ψ ( · ) consists of two fullyconnected layers followed by a softmax activation layer tooutput class probabilities. The combination of ( E θ ( · ) , C ψ ( · )) is optimised using cross entropy loss. We term this combinedarchitecture as CombN et . We define
CombN et E θ , as thecombined model whose encoder, E θ ( · ) is pre-trained with re-construction loss from Stage-1. CombN et R is the CombN et model in which the encoder is randomly initialised (withoutany pre-training).
This layer is proposed to retain the local texture energy dur-ing spatial sub-sampling in CNN. The de facto choice for sub-sampling in CNN is by Max-pool which is more appropriateto determine the presence/absence of a particular feature overthe sampled window. For iris images which have local tex-tural patterns, it is more prudent to retain the energy of thesub-sampled window. With this in mind, for a pooling ker-nel of receptive field k × k , EAP calculates the average of the k pixels instead of finding the maximum as in Max-pool op-eration. Downsampling is achieved by operating this kernelwith stride of 2 pixels. This way of retaining the energy whiledownsampling finds close analogy with energy of filter bankresponses that has been traditionally used as discriminativefeature for texture classification [17, 19]. We term the modelwith the proposed EAP layer as
CombN et
EAPE θ . This layer is designed to alleviate the need of penultimatefully connected layer of
CombN et
EAPE θ . This computation-ally heavy fully connected layer has entire image as its recep-tive field and thus looses local textures which are more impor-tant for iris recognition. Therefore, in this stage our present CombN et
EAPE θ is made more texture attentive by adding TEL after the last convolution layer. In this layer we use spatialaveraging kernels with spatial support equal to dimension offeature maps from previous layer. So, if input to
TEL layeris H × W × C , output from it is × × C . These stackedaverage values closely corresponds to the energy of each ac-tivation maps of the previous layer. The output of TEL is thennally passed to a single fully connected layer which is fol-lowed by softmax activation to get the final class probabilities.This combined texture attentive model having both
EAP and
TEL layers is termed as
CombN et
EAP + T ELE θ which is shownin Figure 1. As TEL alleviates the need of penultimate fullyconnected layer, it helps in dramatically reducing the parame-ter count (46.72 × cheaper) than our baseline having two fullyconnected layers as reported in Table 2. Table 1 . Configurations of various layers of E θ ( · ) and D φ ( · ) Type Kernel Stride Padding OutputChannelsEncoder
Conv × × × × × × × × Decoder
Pixel Shuffle [20] 64Pixel Shuffle [20] 16Pixel Shuffle [20] 4Pixel Shuffle [20] 1
Representative iris signatures (1024-D) were extracted fromthe
TEL layer of
CombN et
EAP + T ELE θ . Two iris images arematched depending on the dissimilarity score obtained fromthe normalised euclidean distance between their respectiveiris signatures.
4. EXPERIMENTS4.1. Comparing Methods
We compare our proposed framework with three traditionalbaselines: Daugman [1], Masek [6] and Ma et al. [8]. Fromdeep learning paradigm, we compare against a pre-trained(on ImageNet) VGG-16 fined tuned on the iris dataset. Thiswas one of the initial attempts of applying transfer learningwith deep neural nets for iris data [11, 13]. We also com-pare against DeepIrisNet [12] which is a much deeper modelhaving 8 convolution and 3 fully connected layer. .
We present our results on CASIA.v4-Distance [21] andCASIA.v4-Thousand [21]. Iris of left and right eye havedisparate patterns [2] and are thus attributed to differentclasses i.e., number of classes is twice the number of subjectspresent in the dataset.The framework of [22] is used for iris segmentation andnormalization. Normalised iris of three different subjects ofCASIA.v4-Distance dataset is shown in Figure 2. Spatialresolution of normalised iris images for all experiments is512 ×
64 unless stated otherwise. For fair comparison, samesegmentation and normalization protocol are followed for allexperiments. We used the following two dataset configura-tions for performance evaluation.
Within Dataset:
Here, ‘training+validation’ and test splits
Table 2 . Self ablation of various architectural choices.
Model Classification Accuracy(in %) ) CombNet Eθ CombNet
EAPEθ
CombNet
EAP + TELEθ are selected from CASIA.v4-Distance dataset [21] having 142subjects. Experiments were conducted on 4773 samples from284 (left and right iris are considered as different classes)classes. Out of these 284 classes, ‘training+validation‘ splitcomprises of 80 % of the classes and the remaining disjoint20 % forms the test split used for reporting verification results(using matching framework of section 3.2). Cross Dataset:
In this setting, all the pre-trained mod-els (trained on CASIA.v4-Distance) were directly used onCASIA.v4-Thousand dataset without any fine-tuning. Thischallenging configuration therefore evaluates the general-ization capability of the different competing deep learningframeworks. CASIA.v4-Thousand has 2000 classes (left andright iris belong to different classes). We perform 5-foldtesting. Each fold consists of th of total classes. Averagematching performance over the 5-folds is reported.Following the matching framework of [12], the test set forboth the above configurations is divided into gallery (enrolledimages) and probe (query) set. 50% of the identities in probeset are imposters (identities not enrolled in the system) whilethe rest are genuine identities. Inthis section, we perform self ablation of variants of architec-tural choices. We use classification accuracy on validationsubset from the ’training+validation’ split as a metric formodel selection. Metrics are reported in Table 2. a) Benefit of Stage-wise Training:
Classification accuracyof
CombN et E θ is while that of CombN et R is . This clearly shows the benefit of pre-trainingthe encoder part of CombN et over random initialised en-coder (
CombN et R ). Further, for reasoning the superior-ity of CombN et E θ over CombN et R , we study relevancemap of a given iris image correctly classified by both themodels. Relevance map gives an indication of which inputpixels were important for classification. Fig 2 shows rele-vance (heat) map of both the aforementioned models fromthree different classes of CASIA.v4-Distance dataset. It isevident from figure that pre-training the encoder encourages CombN et E θ to focus more on the texture patterns as opposedto CombN et R which primarily concentrates on the overallshape cues obtained from the boundary (separating iris regionfrom background) pixels. Instigated from this observation,we incorporate additional improvements on CombN et E θ that further exploits the textural cues for better performance. b) Benefit of EAP and TEL layers: From Table 2 we observe,as Max-Pool layer is replaced by
EAP , correspondingly clas-sification accuracy increases from 60.53% to 74.09% . This able 3 . Comparison on CASIA.v4-Distance (within dataset configuration).
Model EER(in %) AUC ) Traditional
Masek [6] 5.70 0.030 XXXLi Ma et al. [8] 5.45 0.026 XXXDaugman [1] 5.20 0.015 XXX
Deep Nets
VGG-16 4.88 0.012 135.2DeepIrisNet [12] 4.80 0.011 291.2
CombNet
EAP + TELEθ (Proposed) 3.25 0.004 2.9 bolsters our assumption that
EAP layer is more beneficialfor sub-sampling than Max-Pool on texture-rich images.With replacement of the penultimate fully connected layerof
CombN et
EAPE θ with TEL layer, we see a further improve-ment of performance by our
CombN et
EAP + T ELE θ model. Exp 2- Within and Cross dataset comparison of our pre-ferred architecture with existing methods:
From Exp 1, itis clearly evident that
CombN et
EAP + T ELE θ outperforms ourother architectural choices. Therefore, in this phase compar-ison of our best architectural choice with existing traditionalas well as deep learning models are presented. Performance isevaluated based on EER (Equal Error Rate) , and
AUC (AreaUnder the Curve) of the Detection Error Tradeoff (DET)curve. We also report parameter counts of the competingdeep nets which are metrics of computational complexity.Only test set (of within and cross dataset configuration)of both the dataset is used for reporting iris verification per-formance. (a.) Within Dataset:
First, we compare efficacy of our pro-posed
CombN et
EAP + T ELE θ with three traditional baselinesof Daugman [1], Masek [6] and Ma et al. [8]. Across boththe metrics reported in Table 3, our proposed framework out-performs all the three baselines by notable margins. Next,we compare with the recent deep learning frameworks. Weinitially compare against pre-trained (on Imagenet) VGG-16 fine tuned on CASIA.v4-Distance dataset similar to thework done by [11, 13]. Normalised iris of × reso-lution is input to VGG-16 framework. Though fine-tuning apre-trained (on Imagenet) VGG-16 performs better than thetraditional methods, yet CombN et
EAP + T ELE θ proves to besuperior than it. This can be primarily attributed to the factthat the kernels of VGG-16 were trained to learn structureand shape cues present in natural images and not texture-richcontents as prevalent in iris images. Thus, naively apply-ing transfer learning across such disparate domains is sub-optimal. From Table 3, we also observe that our proposedshallow CombN et
EAP + T ELE θ performs better than DeepIris-Net [12]. This boost is primarily because of our systematicdesign choices. As argued before, our stage-wise trainingcompels the network to focus more on discriminating iristextures which is further improved with incorporation of EAP and
TEL layers. Also, for a iris dataset having paucity ofannotated labels, it is more prudent to have less complex(parameter counts) models over deeper counterparts. BothDeepIrisNet as well as fine-tuned VGG-16 have much deeper
Table 4 . Comparison on CASIA.v4-Thousand (cross dataset configuration).
Model EER(in %) AUC
CASIA.v4-Thousand
DeepIrisNet 6.6 0.033VGG-16 6.6 0.028
CombNet
EAP + TELEθ (Proposed) 5.3 0.018
Fig. 3 . DET curve of:
Left: comparing traditional and deep learning methodson CASIA.v4-Distance (Within Dataset),
Right: comparing deep learning methods onCASIA.v4-Thousand (Cross Dataset) and complex architectures for limited annotated iris datasets,and thus our model consistently outperforms those. Figure3 depicts the DET curve of all the competing models of thisphase. (b.) Cross Dataset:
From Table 4, it is evident that even insuch challenging scenario, our proposed framework performsbetter than the comparing deep networks. This proves bettergeneralization capability of our proposed framework overother deep learning frameworks. Figure 3 depicts the DETcurve of one of the randomly selected folds of the compet-ing deep nets. For fairness, same fold is chosen for all thecomparing models.
Reduction of Parameters:
There is an increased demand torun biometrics systems on mobile devices. So lightweightmodels are favored for inference. In Table 2, we comparenumber of parameters of our different architectural choices.We see that replacing full-connected layers of
CombN et E θ with TEL layer in
CombN et
EAP + T ELE θ results in 46.72 × reduction in parameters. From Table 3, it can be observedthat compared to VGG-16 and DeepIrisNet [12], our model, CombN et
EAP + T ELE θ is respectively 46.62 × and 100.41 × cheaper in terms of parameters; yet our performance is betterthan those. It is suggested in this section to note that input toVGG-16 are normalised iris of dimension × , while allother models have input iris images dimension of × .
5. CONCLUSION
This paper proposes stage-wise texture aware training strate-gies for building reliable iris verification system under lim-ited annotated data. This paper showcases benefits of un-supervised auto-encoder based pre-traning as a good weightinitializer for training networks with less data. Further, pro-posed
EAP and
TEL layers are shown to leverage local tex-ture patterns of iris images. Our final framework is signif-icantly lightweight and consistently outperforms competingbaselines for within and cross dataset evaluations. Motivatedby the success of auto-encoder based pre-training, in future,we wish to study the benefits of other recent generative mod-els. . REFERENCES [1] J. Daugman, “How iris recognition works,”
IEEE Trans-actions on Circuits and Systems for Video Technology ,vol. 14, no. 1, pp. 21–30, Jan 2004.[2] John G Daugman, “High confidence visual recogni-tion of persons by a test of statistical independence,”
IEEE transactions on pattern analysis and machine in-telligence , vol. 15, no. 11, pp. 1148–1161, 1993.[3] D de Martin-Roche, Carmen Sanchez-Avila, and RaulSanchez-Reillo, “Iris recognition for biometric identi-fication using dyadic wavelet transform zero-crossing,”in
Proceedings IEEE 35th Annual 2001 InternationalCarnahan Conference on Security Technology (Cat. No.01CH37186) . IEEE, 2001, pp. 272–277.[4] Richard P Wildes, Jane C Asmuth, Gilbert L Green,Steven C Hsu, Raymond J Kolczynski, James R Matey,and Sterling E McBride, “A machine-vision system foriris recognition,”
Machine vision and Applications , vol.9, no. 1, pp. 1–8, 1996.[5] Richard P Wildes, “Iris recognition: an emerging bio-metric technology,”
Proceedings of the IEEE , vol. 85,no. 9, pp. 1348–1363, 1997.[6] Libor Masek et al.,
Recognition of human iris patternsfor biometric identification , Ph.D. thesis, Masters thesis,University of Western Australia, 2003.[7] Donald M Monro, Soumyadip Rakshit, and DexinZhang, “Dct-based iris recognition,”
IEEE transactionson pattern analysis and machine intelligence , vol. 29,no. 4, pp. 586–595, 2007.[8] Li Ma, Tieniu Tan, Yunhong Wang, and Dexin Zhang,“Personal identification based on iris texture analysis,”
IEEE transactions on pattern analysis and machine in-telligence , vol. 25, no. 12, pp. 1519–1533, 2003.[9] Ross Girshick, “Fast r-cnn,” in
Proceedings of the IEEEinternational conference on computer vision , 2015, pp.1440–1448.[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton, “Imagenet classification with deep convolutionalneural networks,” in
Advances in neural informationprocessing systems , 2012, pp. 1097–1105.[11] Shervin Minaee, Amirali Abdolrashidiy, and Yao Wang,“An experimental study of deep convolutional featuresfor iris recognition,” in . IEEE, 2016,pp. 1–6.[12] Abhishek Gangwar and Akanksha Joshi, “Deepirisnet:Deep iris representation with applications in iris recog-nition and cross-sensor iris recognition,” in .IEEE, 2016, pp. 2301–2305.[13] Kien Nguyen, Clinton Fookes, Arun Ross, and SridhaSridharan, “Iris recognition with off-the-shelf cnn fea-tures: A deep learning perspective,”
IEEE Access , vol.6, pp. 18848–18855, 2017.[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei, “Imagenet: A large-scale hierarchicalimage database,” in . Ieee, 2009, pp. 248–255.[15] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick, “Microsoft coco: Common objectsin context,” in
European conference on computer vision .Springer, 2014, pp. 740–755.[16] Sebastian Bach, Alexander Binder, Gr´egoire Montavon,Frederick Klauschen, Klaus-Robert M¨uller, and Woj-ciech Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance prop-agation,”
PloS one , vol. 10, no. 7, 2015.[17] Ju Han and Kai-Kuang Ma, “Rotation-invariant andscale-invariant gabor features for texture image re-trieval,”
Image and vision computing , vol. 25, no. 9,pp. 1474–1481, 2007.[18] Michael Unser, “Texture classification and segmenta-tion using wavelet frames,”
IEEE Transactions on imageprocessing , vol. 4, no. 11, pp. 1549–1560, 1995.[19] Mahamadou Idrissa and Marc Acheroy, “Texture classi-fication using gabor filters,”
Pattern Recognition Letters ,vol. 23, no. 9, pp. 1095–1102, 2002.[20] Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Ca-ballero, Andrew Cunningham, Alejandro Acosta, An-drew Aitken, Alykhan Tejani, Johannes Totz, ZehanWang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2017, pp. 4681–4690.[21] CASIA.v4 Iris Database, “CASIA.v4 Iris Database,” .[22] Zijing Zhao and Kumar Ajay, “An accurate iris seg-mentation framework under relaxed imaging constraintsusing total variation model,” in