SqueezeFacePoseNet: Lightweight Face Verification Across Different Poses for Mobile Platforms
Fernando Alonso-Fernandez, Javier Barrachina, Kevin Hernandez-Diaz, Josef Bigun
SSqueezeFacePoseNet: Lightweight Face VerificationAcross Different Poses for Mobile Platforms
Fernando Alonso-Fernandez a , Javier Barrachina b , Kevin Hernandez-Diaz a , Josef Bigun a a Center for Applied Intelligent Systems Research (CAISR), Halmstad University, 30118 Halmstad, Sweden b Facephi Biometria, Avenida Mxico 20, Edificio MARSAMAR, 03008 Alicante, SpainEmails: [email protected], [email protected], [email protected], [email protected]
Abstract —Virtual applications through mobile platforms areone of the most important and ever-growing fields in AI to-day, where ubiquitous and real-time person authentication hasbecome critical after the breakthrough of all kind of servicesprovided via mobile devices. In this context, face verificationtechnologies can provide reliable and robust user authentication,given the availability of cameras in these devices, as well as theirwidespread use in everyday applications. The rapid developmentof deep Convolutional Neural Networks (CNNs) has resultedin many accurate face verification architectures. However, theirtypical size (hundreds of megabytes) makes them infeasible tobe incorporated in downloadable mobile applications where theentire file typically may not exceed 100 Mb. Accordingly, weaddress the challenge of developing a lightweight face recognitionnetwork of just a few megabytes that can operate with sufficientaccuracy in comparison to much larger models. The networkalso should be able to operate under different poses, giventhe variability naturally observed in uncontrolled environmentswhere mobile devices are typically used. In this paper, we adaptthe lightweight SqueezeNet model, of just 4.4MB, to effectivelyprovide cross-pose face recognition. After trained on the MS-Celeb-1M and VGGFace2 databases, our model achieves an EERof 1.23% on the difficult frontal vs. profile comparison, and0.54% on profile vs. profile images. Under less extreme variationsinvolving frontal images in any of the enrolment/query imagespair, EER is pushed down to < I. I
NTRODUCTION
All kind of services are migrating from physical to digitaldomains. Mobiles have become data hubs, storing sensitivedata like payment information, photos, emails or passwords[1]. In this context, biometric technologies hold a greatpromise to provide reliable and robust user authenticationusing the sensors embedded in such devices [2]. But in orderfor algorithms to operate with sufficient accuracy, they needto be adapted to the limited processing resources of mobiledevices. Data templates also have to be small if they are to betransmitted. On top of it, mobile environments usually implylittle control in the acquisition (e.g. on-the-move or on-the-go),leading to huge variability in data quality.In this work, we are interested in face technologies inmobile environments. Face verification is increasingly usedin applications such as device unlock, mobile payments, lo-gin to applications, etc. Recent developments involve deep learning [3]. Given enough data, they generate classifierswith impressive performance in unconstrained scenarios withhigh variability. However, state-of-the-art solutions are builtupon big deep Convolutional Neural Networks (CNNs), e.g.[4], with dozens of millions of parameters and models thattypically occupy hundreds of megabytes. Such a big size andthe computational resources that such networks require makethem unfeasible for embedded mobile applications.In recent years, lighter CNNs have been proposed forcommon visual tasks, e.g. MobileNetV1 [5], MobileNetV2[6], ShuffleNet [7] or SqueezeNet [8]. These provide lighterarchitectures and faster processing times. Several works havebench-marked some of these networks for face recognition[9]–[11]. Even if they employ training databases that containimages captured under a wide range of variations, they havenot specifically assessed face recognition performance acrossdifferent poses. In this work, our main contribution is there-fore a novel lightweight face recognition network which istested against a database specifically designed to explore posevariations [12]. We base our developments on SqueezeNet,which is a much lighter architecture than the other networks.To the best of our knowledge, this is the first work testingdeep face recognition performance specifically under differentposes and in mobile environments. With a database of 11040images from 368 subjects captured with different poses, ourexperiments show that the proposed network compares wellagainst two larger benchmark networks having a size > >
20 times more parameters.II. R
ELATED W ORKS
Lightweight CNNs employ different techniques to achieveless parameters and faster processing, such as point-wiseconvolution, depth-wise separable convolution to replace thevanilla convolution, and bottleneck layers. Point-wise con-volutions consist of 1 × × ×
3, depth-wise separableconvolution achieves a computational reduction of 8-9 timesin comparison to standard convolution, with a small cost in a r X i v : . [ c s . C V ] J u l etwork Input size Layers Model Size Parameters Vector Size Inference TimeLightCNN [13] 128 ×
128 29 n/a 12.6M 256 n/aExisting literature MobileFaceNets [9] 112 ×
112 50 4MB 0.99M 256 24ms (*)MobiFace [10] 112 ×
112 45 11.3MB n/a 512 28ms (*)ShuffleFaceNet [11] 112 ×
112 n/a 10.5MB 2.6M 128 29.1ms (*)SeesawFaceNets [14] 112 ×
112 50 n/a 1.3M 512 n/a
SqueezeFacePoseNet ×
113 18 4.41MB 1.24M 1000 37.7msThis paper +GDC ×
113 18 5.01MB 1.4M 1000 38.7ms +DWC ×
113 18 2.5MB 0.69M 1000 36.4ms +DWC+GDC ×
113 18 3.1MB 0.86M 1000 36.9msResNet50ft [12] 224 ×
224 50 146MB 25.6M 2048 0.16sSENet50ft [12] 224 ×
224 50 155MB 28.1M 2048 0.21s
TABLE I: Top: proposed lightweight models in the literature for face recognition. Bottom: networks evaluated in the presentpaper. (*) Inference times are as reported in the respective papers, so they are not fully comparable. The hardware used inthe reported studies includes a Qualcomm Snapdragon 820 mobile CPU @ 2.2 GHz [9], an Intel i7-6850K CPU @ 3.6GHz[10], and an Intel i7-7700HQ CPU @ 2.80 GHz [11]. The latter also carries out a comparison of different devices, includinghigh-end GPUs, with inference times reduced around one order of magnitude. Please refer to the original papers for details.Inference in this paper is done with an Intel i7-8650U CPU @ 1.9GHz.accuracy only [5]. Bottleneck layers consist on obtaining arepresentation of the input with reduced dimensionality beforeprocessing it with a larger amount of filters that usually havebigger spatial dimensions as well.SqueezeNet is one of the early works that focused on anarchitecture with fewer parameters and a smaller size (1.24Mparameters, 4.6MB, and 18 convolutional layers). The authorsproposed 1 × ETWORK A RCHITECTURE
As back-bone model, we employ SqueezeNet [8]. Thisis the smallest architecture among the generic light CNNsmentioned. With only 1.24M parameters and 4.6 MB inits uncompressed version, it matched AlexNet accuracy onImageNet with 50x fewer parameters. The building brick ofSqueezeNet, called fire module (Figure 1), contains two layers:a squeeze layer and an expand layer. The squeeze layer uses1 × × ×
1, instead of 3 ×
3, to achievefurther parameter reduction. The squeezing (bottleneck) andexpansion behavior is common in CNNs, and it helps to reducethe amount of parameters, while keeping the same feature mapsize between the input and output [6]. In addition, SqueezeNetuses late downsampling, so many convolution layers havelarge activation maps. Intuitively, this should lead to a higheraccuracy. The architecture of the employed network is shownin Table II, which mirrors the one of [8] with slight changes.The network has been modified to employ an input size of113 × ×
3. It starts with a convolutional layer with 64 filtersof size 3 × × × × × × × × × are initializedfrom scratch, then trained on the MS-Celeb-1M [17] dataset, https://github.com/ox-vgg/vgg face2 output × × × × ×
64 - - -maxpool1 56 ×
64 - - -fire2 56 ×
128 16 64 64fire3 56 ×
128 16 64 64fire4 56 ×
256 32 128 128maxpool4 27 ×
256 - - -fire5 27 ×
256 32 128 128fire6 27 ×
384 48 192 192fire7 27 ×
384 48 192 192fire8 27 ×
512 64 256 256maxpool8 13 ×
512 - - -fire9 13 ×
512 64 256 256dropout9 13 ×
512 - - -conv10 13 × × × × × C - - -softmax 1 × C - - -
TABLE II: Architecture of the employed network. C is thenumber of classes of the training set.and further fine-tuned on the VGGFace2 dataset. We will referto these as ResNet50ft and SENet50ft.IV. D
ATABASE AND E XPERIMENTAL P ROTOCOL
We use the VGGFace2 dataset, with 3.31M images of 9131celebrities, and an average of 363.6 images per person [12].The images, downloaded from the Internet, show large vari-ations in pose, age, ethnicity, lightning and background. Thedatabase is divided into 8631 training classes (3.14M images),and the remaining 500 for testing. To enable recognition acrossdifferent pose, a subset of 368 subjects from the test set isprovided (VGGFace2-Pose for short), with 10 images per pose(frontal, three-quarter, and profile), totalling 11040 images.To further improve recognition performance of our mobilenetwork, we also use the RetinaFace cleaned set of the MS-Celeb-1M database [17] to pre-train our model (MS1M forshort). Face images are pre-processed to a size of 112 × AME-POSE CROSS-POSE template genuine impostor genuine impostor1 image 368 × (9+8+...+1) = 16560 368 ×
100 = 36800 368 × ×
10 = 36800 368 ×
100 = 368005 images 368 × ×
100 = 36800 368 × × ×
100 = 36800
TABLE III: Number of biometric verification scores. (a) VGGFace2 pose templates from three viewpoints (frontal,three-quarter, and profile, arranged by row). Image from [12].(b) VGGFace2 training images with random crop.(c) MS-Celeb-1M from three users (by row) and three profiles(by column: frontal (1-2), three-quarter (3-4), and profile (5)).
Fig. 2: Example images of the databases employed.resized, so the shorter side has 256 pixels, then a 224 × × ×
224 crop of thecenter is then done (instead of a random crop), followed by aresize to 113 × χ in our case) isthen used to obtain the similarity between two templates. WithResNet50ft and SENet50ft architectures, we use as descriptorthe output of the layer adjacent to the classification layer, withdimensionality 2048. Also, ResNet50ft and SENet50ft employinput images of 224 × ESULTS
A. Same-Pose Comparisons
We first report experiments of same-pose comparisons, i.e.comparing only templates generated with images having theig. 4: SqueezeFacePoseNet: Face verification results (same-pose comparisons). Better in colour.Fig. 5: ResNet50ft and SENet50ft (same-pose comparisons).Better in colour.same pose (Figure 3, top). Genuine trials are done by com-paring each template of a user to the remaining templates ofthe same user, avoiding symmetric comparisons. Concerningimpostor experiments, the first template of a user is used asenrolment template, and compared with the second templateof the next 100 users. Table III (left) shows the total number ofscores with this protocol. Recall than when templates are gen-erated using 5 images, there are only two templates availableper user and per pose. On the other hand, when templates aregenerated with only one image, there are ten templates per userand per pose. Face verification results following this protocolare given in Figures 4 and 5. Also, Table IV, left, shows theEER values of the same-pose experiments.A first observation is that our SqueezeFacePoseNet modelprovides in general better results without the inclusion ofGlobal Depth-wise Convolution (GDC). This is in contrastto some previous studies where GDC is reported to providea better performance [9], [10]. It should be mention thoughthat the authors of our baseline networks kept the GAPlayer in ResNet50ft and SENet50ft models [12]. One possiblereason of these results is that in training with VGGFace2, theface region is randomly cropped from the detected boundingbox [12], leading to images where faces are not aligned(Figure 2b). This may serve as an ‘augmentation’ strategy,making counterproductive the use of GDC to learn different
Recognition Same-Pose Cross-PoseNetwork F-F 3/4-3/4 P-P F-3/4 3/4-P F-P
SqueezeFacePoseNet 6.39 5.47 7.88 6.09 7.02 8.15+GDC 8.67 7.18 9.18 8.06 9 10.59+DWC 8.28 7.77 12.27 8.11 11.08 12.03+DWC+GDC 10.07 9.11 14.04 9.86 12.67 14.24ResNet50ft 4.14 3.13 5.16 3.68 4.25 4.99SENet50ft (a) Template consisting on one face per user.
Recognition Same-Pose Cross-PoseNetwork F-F 3/4-3/4 P-P F-3/4 3/4-P F-P
SqueezeFacePoseNet 0.27 0.06 0.54 0.2 0.88 1.23+GDC 0.27 0.08 0.37 0.15 0.75 1.29+DWC 0.39 0.54 1.11 0.47 1.98 2.85+DWC+GDC 0.81 0.61 1.63 0.68 1.82 3.39ResNet50ft
SENet50ft 0.02 (b) Template consisting on five faces per user.
TABLE IV: Face verification results on the VGGFace2-Pose database (EER %). F=Frontal View. 3/4= Three-Quarter.P=Profile. The best result of each column is marked in bold.weights for each spatial region, since faces are not spatiallyaligned during training. The use of depth-wise separableconvolution (DWC) in SqueezeFacePoseNet also results in aslight decrease of performance. This is to be expected [5],although it should be taken into account that adding DWC toour network reduces its model size by about 60% (Table I).Among all the networks evaluated, SENet50ft clearly standsout, specially when templates are generated with only oneimage (top part of Table IV), which is a much adverse casethan the combination of five images (bottom part). The supe-riority of SENet50ft over ResNet50ft for face recognition isalso observed in the paper where they were presented [12], dueto the inclusion of Squeeze-and-Excitation blocks. RegardingSqueezeFacePoseNet, its performance is comparatively worse.Even in that case, we believe that it obtains meritorious results,considering that it employs images of 113 ×
113 (instead of224 × >
30 times smaller than ResNet50ft andSENet50ft, and it has >
20 times fewer parameters. The goodig. 6: SqueezeFacePoseNet: Face verification results (cross-pose comparisons). Better in colour.Fig. 7: ResNet50ft and SENet50ft (cross-pose comparisons).Better in colour.results of SqueezeFacePoseNet are specially evident whenusing templates of five images, in whose case its EER is < < ∼ ∼
2% with thelighter SqueezeFacePoseNet+DW (see Figure 4). It should beconsidered though that the images of any user are mostlycaptured in different moments and they contain a very diversevariability, so the model generated when combining severalof them is probably richer than if the images were takenconsecutively (e.g. by recording a video). In this sense, itcould be expected that the improvement would not be so highif for example we combine several shots taken consecutively,although confirming this would need extra experiments.
B. Cross-Pose Comparisons
We now carry out cross-pose verification experiments. Inthis case, pair-wise comparisons are done between templatesgenerated with images of different poses (Figure 3, bottom).We follow the same protocol for genuine and impostor scoresgeneration as in Section V-A, resulting in the amount indicatedin Table III (right). Face verification results of cross-poseexperiments are given in Figures 6 and 7. Also, Table IV,right, shows the EER values of the cross-pose experiments.In a similar vein as Section V-A, SqueezeFacePoseNetworks better in general without Global Depth-wise Convolu-tion (GDC), and a slight decrease of performance is seen whenusing depth-wise separable convolutions (DWC). We can seeas well that SENet50ft stands out. With SqueezeFacePoseNet,results are up to one order of magnitude worse with templatesof five image, and only ∼ C. Effect of Training Database
We now investigate the effect of the training set in ourmobile architecture (Table V), with all networks started fromImageNet pre-training, and trained from biometric identifica-tion as described in Section IV. In case that only one databaseis used for training, it can be seen that better results areobtained if the model is trained on a database with moresamples per user (VGGFace2), rather than on a database withmore samples and more users overall but with less samplesper user (MS1M). But the biggest benefit in most cases iswhen the model is trained first on MS1M, and then fine-tuned on VGGFace2 (row ‘both’). This is in line with theresults reported in [12]. The biggest advantage is obtainedwhen only one image is used to generate a user template,with improvements of up to 28% in comparison to training onVGGFace2 only. The effect is more diluted when five imagesare combined to create a user template, specially in cross-pose experiments. In this case, it is slightly better to trainonly on VGGFace2. However, it is not always the case thatsuch amount of images are always available to generate a usertemplate, e.g. in forensics [20].VI. C
ONCLUSION
We are interested in the development of a lightweight deepnetwork architecture capable of providing accurate cross-poseface recognition under the restrictions of mobile architectures.For this purpose, we have adapted a very light model of only4.41MB [8] to operate with small face images of 113 × Training Same-Pose Cross-PoseData F-F 3/4-3/4 P-P F-3/4 3/4-P F-PMS1M 16.82% 16.23% 20.24% 17.45% 21.24% 24.19%VGGFace2 8.93% 6.97% 8.34% 8.35% 8.16% 10.35%both (-28%) (-22%) (-6%) (-27%) (-14%) (-21%) (a) Template consisting on one face per user.
Training Same-Pose Cross-PoseData F-F 3/4-3/4 P-P F-3/4 3/4-P F-PMS1M 1.17% 2.17% 3.25% 1.7% 5.24% 7.01%VGGFace2 both (b) Template consisting on five faces per user.
TABLE V: Effect of the training database in Squeeze-FacePoseNet (EER). F=Frontal View. 3/4= Three-Quarter.P=Profile. The best result of each column is marked in bold.Performance variation of the ‘both’ w.r.t. the ‘VGGFace2’ rowis given in brackets.comparison to other databases. MS-Celeb-1M contains a largernumber of images (3.16M in our experiments), but a largernumber of identities as well (35K), so its number of imagesper identity is smaller. Following recommendations [12], wecombine a large database (MS-Celeb-1M) and a database withmore intra-class diversity (VGGFace2) to train the recognitionnetwork. This is shown to provide increased performance incomparison to using only one of them (Table V).To achieve further reductions in the size of our model,we test the replacement of standard convolutions with depth-wise separable convolutions [5], leading to a network of just2.5MB. We also test Global Depth-wise Convolution (GDC)in substitution of the standard Global Average Pooling (GAP)at the end of the network, since some works report that itprovides better face recognition performance [9], [10]. Theemployed architecture is bench-marked against two state-of-the-art architectures [12] having a size >
30 times biggerand >
20 times more parameters (Table I). We evaluate twobiometric verification scenarios, consisting of using a differentnumber of face images to generate a user template. In onecase, a template consists of a combination of five face imageswith the same pose, following the evaluation protocol of [12].In the second case, we consider the much more difficult caseof employing only one image to generate a user template.Different combinations of poses between enrolment and querytemplates are tested (Table 3).Obviously, the use of five face images to create a usertemplate provides a much more better performance, withimprovements of up to two orders of magnitude in somecases. Also, in our experiments, we have not observed betterperformance by using Global Depth-wise Convolution, butthe opposite. We speculate that this may be because trainingimages of the VGGFace2 database are obtained by randomlycropping the face bounding box, so faces are not spatiallyaligned (Figure 2b). In this sense, trying to learn differenteights for each spatial region may be counterproductive.In addition, as expected [5], the use of depth-wise separableconvolution results in a slight decrease of performance.Even if our light architecture does not outperform the state-of-the-art networks, it obtains meritorious results even undersevere pose variations between enrolment and query templates.For example, the comparison of frontal vs. profile images givesan EER of 1.23%. Also, the comparison of profile vs. profileimages gives an EER of 0.54%, even if just half of the faceis visible in this case. These results are with a template offive face images, which is revealed as a very effective way toimprove cross-pose recognition performance. With only oneface image per template, the performance of our networkgoes up to 8.15/7.88% respectively in the two mentionedcases. In less extreme cases of pose variability, performance ofour network is even better, for example: 0.88% (three-quartervs. profile view), 0.2% (frontal vs. three-quarter), or 0.27%(frontal vs. frontal).A number of combinations to create enrolment and querytemplates would be of interest, which will be the source offuture work. For example, if video is available, a collectionof frames could be combined for user template generation,probably selecting those with near to frontal pose as well. Howmany images per template are necessary to obtain accurateperformance is also worth to study. In some scenarios likeforensics [20], query data may consist of only one imagewith an arbitrary pose, but several images per suspect may beavailable in the enrolment database. Therefore, one-query vs.multiple-enrolment images is also of interest to evaluate. Also,in our protocol, a template is generated using only images ofthe same pose. Combining images of multiple poses in thesame template could be a way to create a richer user model,further improving performance.To improve the performance of our mobile model, we arealso looking into the incorporation of residual connections[4] and pre-activation of convolutional layers inside residualblocks [21]. Giving the current context where face engines areforced to work with images of people wearing masks, we arealso evaluating the accuracy of our model when using partialimages containing only the ocular regions [22].A
CKNOWLEDGMENT
This work was partly done while F. A.-F. was a visitingresearcher at Facephi Biometria, funded by the Sweden’sInnovation Agency (Vinnova) under the staff exchange andAI program. Authors F. A.-F., K. H.-D. and J. B. also wouldlike to thank the Swedish Research Council for fundingtheir research. Part of the computations were enabled byresources provided by the Swedish National Infrastructure forComputing (SNIC) at NSC Linkping.R
EFERENCES[1] Z. Akhtar, A. Hadid, M. S. Nixon, M. Tistarelli, J. Dugelay, andS. Marcel, “Biometrics: In search of identity and security (q a),”
IEEEMultiMedia , vol. 25, no. 3, pp. 22–35, 2018.[2] A. Jain, K. Nandakumar, and A. Ross, “50 years of biometric research:Accomplishments, challenges, and opportunities,”
Pattern RecognitionLetters , vol. 79, pp. 80–105, Aug 2016. [3] K. Sundararajan and D. L. Woodard, “Deep learning for biometrics:A survey,”
ACM Comput. Surv. , vol. 51, no. 3, May 2018. [Online].Available: https://doi.org/10.1145/3190618[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proc IEEE Conference on Computer Vision and PatternRecognition, CVPR , June 2016, pp. 770–778.[5] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications,”
CoRR , vol. abs/1704.04861,2017. [Online]. Available: http://arxiv.org/abs/1704.04861[6] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mo-bilenetv2: Inverted residuals and linear bottlenecks,” in
IEEE/CVFConference on Computer Vision and Pattern Recognition, CVPR , 2018,pp. 4510–4520.[7] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremelyefficient convolutional neural network for mobile devices,” in
PRocIEEE/CVF Conference on Computer Vision and Pattern Recognition ,2018, pp. 6848–6856.[8] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and < CoRR , vol. abs/1602.07360, 2016.[Online]. Available: http://arxiv.org/abs/1602.07360[9] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mobilefacenets: Efficient cnnsfor accurate real-time face verification on mobile devices,”
CoRR , vol.abs/1804.07573, 2018. [Online]. Available: http://arxiv.org/abs/1804.07573[10] C. N. Duong, K. G. Quach, I. K. Jalata, N. Le, and K. Luu, “Mobiface:A lightweight deep learning face recognition on mobile devices,” in
ProcIEEE 10th International Conference on Biometrics Theory, Applicationsand Systems (BTAS) , Sep. 2019.[11] Y. Martinez-Daz, L. S. Luevano, H. Mendez-Vazquez, M. Nicolas-Diaz,L. Chang, and M. Gonzalez-Mendoza, “Shufflefacenet: A lightweightface architecture for efficient and highly-accurate face recognition,” in , 2019, pp. 2721–2728.[12] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2:A dataset for recognising faces across pose and age,” in , 2018, pp. 67–74.[13] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep facerepresentation with noisy labels,”
IEEE Transactions on InformationForensics and Security , vol. 13, no. 11, pp. 2884–2896, 2018.[14] J. Zhang, “Seesawfacenets: sparse and robust face verification model formobile platform,” 2019.[15] ——, “Seesaw-net: Convolution neural network with uneven groupconvolution,”
CoRR , vol. abs/1905.03672, 2019. [Online]. Available:http://arxiv.org/abs/1905.03672[16] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
IEEEConference on Computer Vision and Pattern Recognition, CVPR , 2018.[17] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A datasetand benchmark for large-scale face recognition,” in
Computer Vision –ECCV 2016 , B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham:Springer International Publishing, 2016, pp. 87–102.[18] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou,“Retinaface: Single-stage dense face localisation in the wild,”
CoRR ,vol. abs/1905.00641, 2019. [Online]. Available: http://arxiv.org/abs/1905.00641[19] S. Kornblith, J. Shlens, and Q. V. Le, “Do better imagenet modelstransfer better?” in
Proc IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR) , 2019, pp. 2656–2666.[20] A. K. Jain and A. Ross, “Bridging the gap: from biometrics to forensics,”
Phil. Trans. R. Soc. , vol. 370, 2015.[21] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residualnetworks,” in
Computer Vision – ECCV 2016 , B. Leibe, J. Matas,N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing,2016, pp. 630–645.[22] F. Alonso-Fernandez and J. Bigun, “A survey on periocular biometricsresearch,”