[PDF] Unsupervised Pre-trained, Texture Aware And Lightweight Model for Deep Learning-Based Iris Recognition Under Limited Annotated Data

Abstract

In this paper, we present a texture aware lightweight deep learning framework for iris recognition. Our contributions are primarily three fold. Firstly, to address the dearth of labelled iris data, we propose a reconstruction loss guided unsupervised pre-training stage followed by supervised refinement. This drives the network weights to focus on discriminative iris texture patterns. Next, we propose several texture aware improvisations inside a Convolution Neural Net to better leverage iris textures. Finally, we show that our systematic training and architectural choices enable us to design an efficient framework with upto 100X fewer parameters than contemporary deep learning baselines yet achieve better recognition performance for within and cross dataset evaluations.

Full PDF

UUNSUPERVISED PRE-TRAINED, TEXTURE AWARE AND LIGHTWEIGHT MODEL FORDEEP LEARNING BASED IRIS RECOGNITION UNDER LIMITED ANNOTATED DATA

Manashi Chakraborty †‡ Mayukh Roy ‡ Prabir Kumar Biswas Pabitra Mita

Indian Institute of Technology, Kharagpur, India

ABSTRACT

In this paper, we present a texture aware lightweight deeplearning framework for iris recognition. Our contributions areprimarily three fold. Firstly, to address the dearth of labellediris data, we propose a reconstruction loss guided unsuper-vised pre-training stage followed by supervised reﬁnement.This drives the network weights to focus on discriminative iristexture patterns. Next, we propose several texture aware im-provisations inside a Convolution Neural Net to better lever-age iris textures. Finally, we show that our systematic train-ing and architectural choices enable us to design an efﬁcientframework with upto 100 × fewer parameters than contem-porary deep learning baselines yet achieve better recognitionperformance for within and cross dataset evaluations. Index Terms — Iris Recognition, Deep Learning, CNN,Texture, Lightweight

1. INTRODUCTION

Iris biometrics, over the last few years have shown immensepotential as an infallible biometric recognition system [1, 2,3, 4, 5]. Iris textures are highly subject discriminative [2] andbeing an internal organ of the eye, it is resilient to environ-mental perturbations and is also immutable over time.The initial works on iris recognition focused on designingtraditional hand engineered features [1, 6, 7, 8]. Recent suc-cess over a variety of vision applications on natural images[9, 10] showcases the unprecedented advantage of deep Con-volution Neural Networks (CNNs) over hand-crafted features.Inspired by the success of CNNs, iris biometric communityalso started exploring the prowess of deep learning. An appre-ciable gain in performance [11, 12, 13] is observed comparedto traditional methods. However, some intrinsic issues suchas absence of large annotated datasets, explicit processing oftexture information and lightweight architecture design havehardly been addressed. In this paper, we address the aboveconcerns with several systematic modiﬁcations over conven-tional CNN training pipelines and architectural choices.

Handling Absence of Large Dataset:

CNNs are data greedyand usually require millions of annotated data for fruitful † All correspondence to : [email protected] ‡ Denotes equal contribution training. This is not an issue for natural images where datasetssuch as Imagenet [14], MS-COCO [15] contain large volumesof annotated data. However, for iris biometrics, the sizes ofthe datasets are usually limited to few thousands. Thus, thisshort-coming still remains an open challenge for deep learn-ing based iris biometric researchers. In this paper, we addressthis problem with a two-stage training strategy. In the ﬁrststage, we pre-train a parameterized feature encoder, E θ ( · ) ,to capture iris texture signatures in an unsupervised train-ing framework. In the second stage, E θ ( · ) acts as a featureextractor and is further reﬁned along with a classiﬁcationhead, C ψ ( · ) . We show that the combined training frame-work provides signiﬁcant boost in performance comparedto single stage training. Further, visualization with LayerWise Relevance Propagation [16] shows that as opposed tosingle-stage training, our proposed stage-wise training drivesthe network weights to focus more on the iris textures. Thisfurther motivated us in designing systematic texture attentivearchitectural choices as mentioned below. Energy Aware Pooling:

Non-parametric spatial sub-sampling(usually realised as Max-pooling) in conventional deep net-works is a crucial and essential component fairly used toretain the maximum response of a speciﬁed window. Inthis paper, we show that on a texture-rich iris [2] dataset,sub-sampling using Energy Aware Pooling (

EAP ) is a betteralternative to max ( · ) operation. Texture Energy Layer:

Usually in deep networks, it is acommon practise to have several fully-connected layers at theend to amalgamate global structure information. However,iris images are mainly rich in local textures. Toward this, wepropose to use Texture Energy Layer (

TEL ) to speciﬁcallycapture energy of the last convolutional ﬁlter bank responses.Such energy based features have been traditionally used fortexture classiﬁcation [17, 18, 19].

Light-weight Model for Inference:

The systematic designstrategies enable us to operate with much shallower architec-ture yet achieve better performance than the deeper baselines.Additionally,

TEL layer obviates the requirement of com-putationally heavy penultimate fully-connected layer of ourproposed base architecture. As a consequence, our model hassigniﬁcantly less parameter counts. This is particularly im-portant since iris biometrics is gradually becoming an integralcomponent of many handheld mobile devices. a r X i v : . [ c s . C V ] F e b ig. 1 . Stagewise training framework of proposed framework of

CombNet

EAP + TELEθ

Our above proposed architectural choices consistentlyoutperforms traditional as well as recent deep nets by a note-worthy margin. Even in scenarios where target dataset isdifferent from training data, our proposed model generaliseswith better performance without the need of even ﬁne-tuningon the target data.

2. RELATED WORK

Initial attempts of iris recognition were primarily inclined to-wards traditional techniques of extracting features from vari-ous ﬁlter bank responses. Daugman [1] extracted representa-tive iris features from responses of 2-D Gabor ﬁlters. Masek et al. extracted response from 1D Log Gabor ﬁlters [6]. Ma et al. [8] proposed a bank of circularly symmetric sinusoidalmodulated Gaussian ﬁlters banks to capture the discriminativeiris textures. Wildes et al. [5] extracted discriminative iris tex-tures from multi-scale Laplacian of Gaussian (LOG). Monro et al. used features from Discrete Cosine Transform (DCT)[7]. To summarize, the earlier works mainly focused on hand-crafted feature representation. Initial attempts [11, 13] ofleveraging deep learning for iris recognition involved featureextraction using well known pre-trained (for ImageNet classi-ﬁcation) neural networks followed by a supervised classiﬁca-tion stage. Recently, Gangwar et al. [12] proposed DeepIris-Net, which is an to end-to-end trainable (from scratch) deepneural network and achieved appreciable boost over the tradi-tional methods.

3. METHODOLOGY3.1. Network Architecture

Stage-1:

In the ﬁrst phase, we follow an unsupervised frame-work for pre-training a feature encoder, E θ ( · ) to capturetexture signatures. For this, we train a convolutional auto-encoder with reconstruction loss, L R . Speciﬁcally, given anormalised iris image, I (an example of normalised iris im-age, I is shown in Figure 1), we project it to a smaller resolu-tion (by strided convolution and spatial sub-sampling) usingthe encoder and then decode it back to the original resolu-tion with a decoder, D φ ( · ) . Conﬁgurations of various lay-ers of encoder, E θ ( · ) and decoder, D φ ( · ) is shown in Table subscript θ refers to set of trainable parameters Fig. 2 . Relevance map (red is most important while blue is least) of three different iriscorresponding to three classes of the CASIA.v4-Distance dataset.

Row 1:

Normalisediris image.

Row 2:

Relevance map of

CombNet R (randomly initialised encoder). Row 3:

Relevance map of

CombNet Eθ (initialised with pre-trained encoder). L R is thus applied between original image, I and recon-structed image, ˆ I = D φ ( E θ ( I )) . In this paper, we have usedthe Structural Similarity (SSIM) metric ∈ { , } as a proxyfor gauging the similarity between original and reconstructedimage. So, we minimise the following: L R = 1 − SSIM ( I, D φ ( E θ ( I ))) . (1) Stage-2

CombNet : In the second stage, activations of E θ ( · ) is passed to the classiﬁcation branch, C ψ ( · ) . Follow-ing the usual trend, the baseline C ψ ( · ) consists of two fullyconnected layers followed by a softmax activation layer tooutput class probabilities. The combination of ( E θ ( · ) , C ψ ( · )) is optimised using cross entropy loss. We term this combinedarchitecture as CombN et . We deﬁne

CombN et E θ , as thecombined model whose encoder, E θ ( · ) is pre-trained with re-construction loss from Stage-1. CombN et R is the CombN et model in which the encoder is randomly initialised (withoutany pre-training).

This layer is proposed to retain the local texture energy dur-ing spatial sub-sampling in CNN. The de facto choice for sub-sampling in CNN is by Max-pool which is more appropriateto determine the presence/absence of a particular feature overthe sampled window. For iris images which have local tex-tural patterns, it is more prudent to retain the energy of thesub-sampled window. With this in mind, for a pooling ker-nel of receptive ﬁeld k × k , EAP calculates the average of the k pixels instead of ﬁnding the maximum as in Max-pool op-eration. Downsampling is achieved by operating this kernelwith stride of 2 pixels. This way of retaining the energy whiledownsampling ﬁnds close analogy with energy of ﬁlter bankresponses that has been traditionally used as discriminativefeature for texture classiﬁcation [17, 19]. We term the modelwith the proposed EAP layer as

CombN et

EAPE θ . This layer is designed to alleviate the need of penultimatefully connected layer of

CombN et

EAPE θ . This computation-ally heavy fully connected layer has entire image as its recep-tive ﬁeld and thus looses local textures which are more impor-tant for iris recognition. Therefore, in this stage our present CombN et

EAPE θ is made more texture attentive by adding TEL after the last convolution layer. In this layer we use spatialaveraging kernels with spatial support equal to dimension offeature maps from previous layer. So, if input to

TEL layeris H × W × C , output from it is × × C . These stackedaverage values closely corresponds to the energy of each ac-tivation maps of the previous layer. The output of TEL is thennally passed to a single fully connected layer which is fol-lowed by softmax activation to get the ﬁnal class probabilities.This combined texture attentive model having both

EAP and

TEL layers is termed as

CombN et

EAP + T ELE θ which is shownin Figure 1. As TEL alleviates the need of penultimate fullyconnected layer, it helps in dramatically reducing the parame-ter count (46.72 × cheaper) than our baseline having two fullyconnected layers as reported in Table 2. Table 1 . Conﬁgurations of various layers of E θ ( · ) and D φ ( · ) Type Kernel Stride Padding OutputChannelsEncoder

Conv × × × × × × × × Decoder

Pixel Shufﬂe [20] 64Pixel Shufﬂe [20] 16Pixel Shufﬂe [20] 4Pixel Shufﬂe [20] 1

Representative iris signatures (1024-D) were extracted fromthe

TEL layer of

CombN et

EAP + T ELE θ . Two iris images arematched depending on the dissimilarity score obtained fromthe normalised euclidean distance between their respectiveiris signatures.

4. EXPERIMENTS4.1. Comparing Methods

We compare our proposed framework with three traditionalbaselines: Daugman [1], Masek [6] and Ma et al. [8]. Fromdeep learning paradigm, we compare against a pre-trained(on ImageNet) VGG-16 ﬁned tuned on the iris dataset. Thiswas one of the initial attempts of applying transfer learningwith deep neural nets for iris data [11, 13]. We also com-pare against DeepIrisNet [12] which is a much deeper modelhaving 8 convolution and 3 fully connected layer. .

We present our results on CASIA.v4-Distance [21] andCASIA.v4-Thousand [21]. Iris of left and right eye havedisparate patterns [2] and are thus attributed to differentclasses i.e., number of classes is twice the number of subjectspresent in the dataset.The framework of [22] is used for iris segmentation andnormalization. Normalised iris of three different subjects ofCASIA.v4-Distance dataset is shown in Figure 2. Spatialresolution of normalised iris images for all experiments is512 ×

64 unless stated otherwise. For fair comparison, samesegmentation and normalization protocol are followed for allexperiments. We used the following two dataset conﬁgura-tions for performance evaluation.

Within Dataset:

Here, ‘training+validation’ and test splits

Table 2 . Self ablation of various architectural choices.

Model Classiﬁcation Accuracy(in %) ) CombNet Eθ CombNet

EAPEθ

CombNet

EAP + TELEθ are selected from CASIA.v4-Distance dataset [21] having 142subjects. Experiments were conducted on 4773 samples from284 (left and right iris are considered as different classes)classes. Out of these 284 classes, ‘training+validation‘ splitcomprises of 80 % of the classes and the remaining disjoint20 % forms the test split used for reporting veriﬁcation results(using matching framework of section 3.2). Cross Dataset:

In this setting, all the pre-trained mod-els (trained on CASIA.v4-Distance) were directly used onCASIA.v4-Thousand dataset without any ﬁne-tuning. Thischallenging conﬁguration therefore evaluates the general-ization capability of the different competing deep learningframeworks. CASIA.v4-Thousand has 2000 classes (left andright iris belong to different classes). We perform 5-foldtesting. Each fold consists of th of total classes. Averagematching performance over the 5-folds is reported.Following the matching framework of [12], the test set forboth the above conﬁgurations is divided into gallery (enrolledimages) and probe (query) set. 50% of the identities in probeset are imposters (identities not enrolled in the system) whilethe rest are genuine identities. Inthis section, we perform self ablation of variants of architec-tural choices. We use classiﬁcation accuracy on validationsubset from the ’training+validation’ split as a metric formodel selection. Metrics are reported in Table 2. a) Beneﬁt of Stage-wise Training:

Classiﬁcation accuracyof

CombN et E θ is while that of CombN et R is . This clearly shows the beneﬁt of pre-trainingthe encoder part of CombN et over random initialised en-coder (

CombN et R ). Further, for reasoning the superior-ity of CombN et E θ over CombN et R , we study relevancemap of a given iris image correctly classiﬁed by both themodels. Relevance map gives an indication of which inputpixels were important for classiﬁcation. Fig 2 shows rele-vance (heat) map of both the aforementioned models fromthree different classes of CASIA.v4-Distance dataset. It isevident from ﬁgure that pre-training the encoder encourages CombN et E θ to focus more on the texture patterns as opposedto CombN et R which primarily concentrates on the overallshape cues obtained from the boundary (separating iris regionfrom background) pixels. Instigated from this observation,we incorporate additional improvements on CombN et E θ that further exploits the textural cues for better performance. b) Beneﬁt of EAP and TEL layers: From Table 2 we observe,as Max-Pool layer is replaced by

EAP , correspondingly clas-siﬁcation accuracy increases from 60.53% to 74.09% . This able 3 . Comparison on CASIA.v4-Distance (within dataset conﬁguration).

Model EER(in %) AUC ) Traditional

Masek [6] 5.70 0.030 XXXLi Ma et al. [8] 5.45 0.026 XXXDaugman [1] 5.20 0.015 XXX

Deep Nets

VGG-16 4.88 0.012 135.2DeepIrisNet [12] 4.80 0.011 291.2

CombNet

EAP + TELEθ (Proposed) 3.25 0.004 2.9 bolsters our assumption that

EAP layer is more beneﬁcialfor sub-sampling than Max-Pool on texture-rich images.With replacement of the penultimate fully connected layerof

CombN et

EAPE θ with TEL layer, we see a further improve-ment of performance by our

CombN et

EAP + T ELE θ model. Exp 2- Within and Cross dataset comparison of our pre-ferred architecture with existing methods:

From Exp 1, itis clearly evident that

CombN et

EAP + T ELE θ outperforms ourother architectural choices. Therefore, in this phase compar-ison of our best architectural choice with existing traditionalas well as deep learning models are presented. Performance isevaluated based on EER (Equal Error Rate) , and

AUC (AreaUnder the Curve) of the Detection Error Tradeoff (DET)curve. We also report parameter counts of the competingdeep nets which are metrics of computational complexity.Only test set (of within and cross dataset conﬁguration)of both the dataset is used for reporting iris veriﬁcation per-formance. (a.) Within Dataset:

First, we compare efﬁcacy of our pro-posed

CombN et

EAP + T ELE θ with three traditional baselinesof Daugman [1], Masek [6] and Ma et al. [8]. Across boththe metrics reported in Table 3, our proposed framework out-performs all the three baselines by notable margins. Next,we compare with the recent deep learning frameworks. Weinitially compare against pre-trained (on Imagenet) VGG-16 ﬁne tuned on CASIA.v4-Distance dataset similar to thework done by [11, 13]. Normalised iris of × reso-lution is input to VGG-16 framework. Though ﬁne-tuning apre-trained (on Imagenet) VGG-16 performs better than thetraditional methods, yet CombN et

EAP + T ELE θ proves to besuperior than it. This can be primarily attributed to the factthat the kernels of VGG-16 were trained to learn structureand shape cues present in natural images and not texture-richcontents as prevalent in iris images. Thus, naively apply-ing transfer learning across such disparate domains is sub-optimal. From Table 3, we also observe that our proposedshallow CombN et

EAP + T ELE θ performs better than DeepIris-Net [12]. This boost is primarily because of our systematicdesign choices. As argued before, our stage-wise trainingcompels the network to focus more on discriminating iristextures which is further improved with incorporation of EAP and

TEL layers. Also, for a iris dataset having paucity ofannotated labels, it is more prudent to have less complex(parameter counts) models over deeper counterparts. BothDeepIrisNet as well as ﬁne-tuned VGG-16 have much deeper

Table 4 . Comparison on CASIA.v4-Thousand (cross dataset conﬁguration).

Model EER(in %) AUC

CASIA.v4-Thousand

DeepIrisNet 6.6 0.033VGG-16 6.6 0.028

CombNet

EAP + TELEθ (Proposed) 5.3 0.018

Fig. 3 . DET curve of:

Left: comparing traditional and deep learning methodson CASIA.v4-Distance (Within Dataset),

Right: comparing deep learning methods onCASIA.v4-Thousand (Cross Dataset) and complex architectures for limited annotated iris datasets,and thus our model consistently outperforms those. Figure3 depicts the DET curve of all the competing models of thisphase. (b.) Cross Dataset:

From Table 4, it is evident that even insuch challenging scenario, our proposed framework performsbetter than the comparing deep networks. This proves bettergeneralization capability of our proposed framework overother deep learning frameworks. Figure 3 depicts the DETcurve of one of the randomly selected folds of the compet-ing deep nets. For fairness, same fold is chosen for all thecomparing models.

Reduction of Parameters:

There is an increased demand torun biometrics systems on mobile devices. So lightweightmodels are favored for inference. In Table 2, we comparenumber of parameters of our different architectural choices.We see that replacing full-connected layers of

CombN et E θ with TEL layer in

CombN et

EAP + T ELE θ results in 46.72 × reduction in parameters. From Table 3, it can be observedthat compared to VGG-16 and DeepIrisNet [12], our model, CombN et

EAP + T ELE θ is respectively 46.62 × and 100.41 × cheaper in terms of parameters; yet our performance is betterthan those. It is suggested in this section to note that input toVGG-16 are normalised iris of dimension × , while allother models have input iris images dimension of × .

5. CONCLUSION

This paper proposes stage-wise texture aware training strate-gies for building reliable iris veriﬁcation system under lim-ited annotated data. This paper showcases beneﬁts of un-supervised auto-encoder based pre-traning as a good weightinitializer for training networks with less data. Further, pro-posed

EAP and

TEL layers are shown to leverage local tex-ture patterns of iris images. Our ﬁnal framework is signif-icantly lightweight and consistently outperforms competingbaselines for within and cross dataset evaluations. Motivatedby the success of auto-encoder based pre-training, in future,we wish to study the beneﬁts of other recent generative mod-els. . REFERENCES [1] J. Daugman, “How iris recognition works,”

IEEE Trans-actions on Circuits and Systems for Video Technology ,vol. 14, no. 1, pp. 21–30, Jan 2004.[2] John G Daugman, “High conﬁdence visual recogni-tion of persons by a test of statistical independence,”

IEEE transactions on pattern analysis and machine in-telligence , vol. 15, no. 11, pp. 1148–1161, 1993.[3] D de Martin-Roche, Carmen Sanchez-Avila, and RaulSanchez-Reillo, “Iris recognition for biometric identi-ﬁcation using dyadic wavelet transform zero-crossing,”in

Proceedings IEEE 35th Annual 2001 InternationalCarnahan Conference on Security Technology (Cat. No.01CH37186) . IEEE, 2001, pp. 272–277.[4] Richard P Wildes, Jane C Asmuth, Gilbert L Green,Steven C Hsu, Raymond J Kolczynski, James R Matey,and Sterling E McBride, “A machine-vision system foriris recognition,”

Machine vision and Applications , vol.9, no. 1, pp. 1–8, 1996.[5] Richard P Wildes, “Iris recognition: an emerging bio-metric technology,”

Proceedings of the IEEE , vol. 85,no. 9, pp. 1348–1363, 1997.[6] Libor Masek et al.,

Recognition of human iris patternsfor biometric identiﬁcation , Ph.D. thesis, Masters thesis,University of Western Australia, 2003.[7] Donald M Monro, Soumyadip Rakshit, and DexinZhang, “Dct-based iris recognition,”

IEEE transactionson pattern analysis and machine intelligence , vol. 29,no. 4, pp. 586–595, 2007.[8] Li Ma, Tieniu Tan, Yunhong Wang, and Dexin Zhang,“Personal identiﬁcation based on iris texture analysis,”

IEEE transactions on pattern analysis and machine in-telligence , vol. 25, no. 12, pp. 1519–1533, 2003.[9] Ross Girshick, “Fast r-cnn,” in

Proceedings of the IEEEinternational conference on computer vision , 2015, pp.1440–1448.[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton, “Imagenet classiﬁcation with deep convolutionalneural networks,” in

Advances in neural informationprocessing systems , 2012, pp. 1097–1105.[11] Shervin Minaee, Amirali Abdolrashidiy, and Yao Wang,“An experimental study of deep convolutional featuresfor iris recognition,” in . IEEE, 2016,pp. 1–6.[12] Abhishek Gangwar and Akanksha Joshi, “Deepirisnet:Deep iris representation with applications in iris recog-nition and cross-sensor iris recognition,” in .IEEE, 2016, pp. 2301–2305.[13] Kien Nguyen, Clinton Fookes, Arun Ross, and SridhaSridharan, “Iris recognition with off-the-shelf cnn fea-tures: A deep learning perspective,”

IEEE Access , vol.6, pp. 18848–18855, 2017.[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei, “Imagenet: A large-scale hierarchicalimage database,” in . Ieee, 2009, pp. 248–255.[15] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick, “Microsoft coco: Common objectsin context,” in

European conference on computer vision .Springer, 2014, pp. 740–755.[16] Sebastian Bach, Alexander Binder, Gr´egoire Montavon,Frederick Klauschen, Klaus-Robert M¨uller, and Woj-ciech Samek, “On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance prop-agation,”

PloS one , vol. 10, no. 7, 2015.[17] Ju Han and Kai-Kuang Ma, “Rotation-invariant andscale-invariant gabor features for texture image re-trieval,”

Image and vision computing , vol. 25, no. 9,pp. 1474–1481, 2007.[18] Michael Unser, “Texture classiﬁcation and segmenta-tion using wavelet frames,”

IEEE Transactions on imageprocessing , vol. 4, no. 11, pp. 1549–1560, 1995.[19] Mahamadou Idrissa and Marc Acheroy, “Texture classi-ﬁcation using gabor ﬁlters,”

Pattern Recognition Letters ,vol. 23, no. 9, pp. 1095–1102, 2002.[20] Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Ca-ballero, Andrew Cunningham, Alejandro Acosta, An-drew Aitken, Alykhan Tejani, Johannes Totz, ZehanWang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2017, pp. 4681–4690.[21] CASIA.v4 Iris Database, “CASIA.v4 Iris Database,” .[22] Zijing Zhao and Kumar Ajay, “An accurate iris seg-mentation framework under relaxed imaging constraintsusing total variation model,” in