An Experimental Evaluation of Covariates Effects on Unconstrained Face Verification
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
An Experimental Evaluation of CovariatesEffects on Unconstrained Face Verification
Boyu Lu,
Student Member, IEEE , Jun-Cheng Chen,
Member, IEEE , Carlos D Castillo,
Member, IEEE and Rama Chellappa,
Fellow, IEEE
Abstract —Covariates are factors that have a debilitating influence on face verification performance. In this paper, we comprehensivelystudy two covariate related problems for unconstrained face verification: first, how covariates affect the performance of deep neuralnetworks on the large-scale unconstrained face verification problem; second, how to utilize covariates to improve verificationperformance. To study the first problem, we implement five state-of-the-art deep convolutional networks (DCNNs) for face verificationand evaluate them on three challenging covariates datasets. In total, seven covariates are considered: pose (yaw and roll), age, facialhair, gender, indoor/outdoor, occlusion (nose and mouth visibility, eyes visibility, and forehead visibility), and skin tone. Thesecovariates cover both intrinsic subject-specific characteristics and extrinsic factors of faces. Some of the results confirm and extend thefindings of previous studies, others are new findings that were rarely mentioned previously or did not show consistent trends. For thesecond problem, we demonstrate that with the assistance of gender information, the quality of a pre-curated noisy large-scale facedataset for face recognition can be further improved. After retraining the face recognition model using the curated data, performanceimprovement is observed at low False Acceptance Rates (FARs) (FAR= − , − , − ). Index Terms —Covariates, Deep convolutional neural networks, Unconstrained face verification. (cid:70)
NTRODUCTION F ACE
Verification has been receiving consistent attention incomputer vision community for over two decades [52]. Thetask of face verification is to verify whether a given pair offace images/templates belongs to the same subject. Recently, dueto the rapid development of deep convolutional neural networks(DCNNs), face verification performance has surpassed humanperformance in most controlled situations and some unconstrainedcases [10], [42], [46], [49]. Although deep features have provento be more robust to moderate variations in pose, aging, occlusionand other factors than hand-crafted features, some recent workshave noticed that face verification performance is still significantlyaffected by many covariates [12], [35], [39], [47].Covariates are factors that usually have an undesirable in-fluence on face verification performance (e.g., gender inducesdifferent human facial appearance characteristics in nature.). Somecovariates represent different aspects of faces such as pose, ex-pression and age, while some other covariates represent subject-specific intrinsic characteristics like gender, race and skin tone,and other covariates reflect extrinsic factors in images, such asilluminations, occlusion and resolution. Analyzing the effectsof these covariates can not only help understand fundamentalproblems in face verification, but also provide insights to improveexisting face verification algorithms.Previous studies have analyzed many covariates effects onface recognition performance [1], [7], [30], [34]. However, most
B. Lu is with the Department of Electrical and Computer Engineering, Univer-sity of Maryland, College Park, 20742, USA (e-mail: [email protected])J. Chen is with the Center for Automation Research, University of Maryland,College Park, 20742, USA (e-mail: [email protected])C. D. Castillio is with the Center for Automation Research, University ofMaryland, College Park, 20742, USA (e-mail: [email protected])R. Chellappa is with the Department of Electrical and Computer Engineeringand the Center for Automation Research, University of Maryland, CollegePark, 20742, USA (e-mail: [email protected]) of them are outdated, and there are several reasons why a newstudy on these covariates is needed. First, most studies have beenconducted before the emergence of deep networks. Since deepnetworks have significantly improved the robustness of featuresagainst many covariates, it is unclear whether the results ofcovariate effects concluded from hand-crafted features are stillvalid when deep features are used. Second, most datasets studiedin previous works are small (e.g., 41,368 images from 68 peoplein CMU PIE [44] dataset.) and the class distribution of some co-variates is severely imbalanced. In this situation, some conclusionsmay become statistically biased. Moreover, due to the absence oflarge data, very few experiments have studied covariate effectsat extremely low FARs ( − , − ). Third, the face images informer studies were captured in a constrained environment (e.g.,CMU PIE dataset [44]), which is less applicable in practice. Lastbut not least, most existing papers only focus on whether somecovariate values have advantages over other values (e.g., whethera male is easier to recognize than a female), but few of themtry to exploit covariate information to improve face verificationperformance. In fact, some covariates (e.g., gender, race) containsubject-specific information of faces, and are more robust to manyextrinsic variations than low-level features. Properly exploitingthem could measurably improve face verification performancesignificantly [24].In this paper, we investigate two important problems: a) howdifferent covariates affect the performance of state-of-the-art DC-NNs for unconstrained face verification; b) how to utilize covariateinformation to improve face verification performance. For the firstproblem, we implement five state-of-the-art face DCNNs and eval-uate them on three challenging covariates protocols: 1:1 covariatesprotocol of the IARPA JANUS Benchmark B (IJB-B) dataset [51]and its extended version the IARPA JANUS Benchmark C (IJB-C) [32], and Celebrity Frontal-Profile Face datasets [43]. We reportthe performance of each individual network and the performance a r X i v : . [ c s . C V ] A ug OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 of the score-level fusion method. We also compare the resultswith some other well-known and publicly available deep facenetworks. Among the datasets, IJB-C 1:1 covariate protocol iscurrently the largest public covariate dataset for unconstrainedface verification. The protocol contains seven covariates coveringdifferent covariate types. Moreover, the IJB-C dataset is designedto have a more uniform geographic distribution of subjects acrossthe globe, which makes it possible to carefully evaluate manycovariates (e.g., like age and skin tone) in details.By conducting extensive experiments on IJB-B and IJB-Cdatasets, we observe many interesting results for different covari-ates. Some of our findings support conclusions drawn from pre-vious studies. For example, extreme yaw angles do substantiallydegrade the performance [43] and outdoor images are harder tobe recognized [29]. Meanwhile, we also find some results whichextend the findings of previous works due to the availability oflarger datasets. For instance, most previous studies show thatface recognition algorithms usually achieve better performance onelder subjects than younger subjects [7], [30]. But in their studies,most of the enrolled subjects are under 40 years old. However,our experiments with much more subjects with a wider age rangeshow that the performance does not monotonically increase asage progresses. The performance increases from age group [0 , to age group [35 , but begins to drop for age group [50 , and . The results demonstrate that neither too young nor tooold people are easy to recognize, but the recognition results forvery young people ( i . e ., [0 , ) are worse. Moreover, we areable to better evaluate some covariates like gender where previousworks came to contradictory conclusions [30]. Our experimentsshow that males are easier to be verified than females in general.However, when we combine gender with other covariates (age,skin tone) to investigate their mixed effects, we find that the faceverification performance for females becomes better than males’for older age group and darker skin tones. Finally, some of ourresults are surprising yet rarely analyzed in other papers. One ex-ample is that roll variations greatly affect verification performancein unconstrained situation. Since most previous studies may haveused manually aligned faces, roll variation was not a significantfactor in their studies. However, in unconstrained environments,face alignment becomes a key component and our finding showsthat the performance variations might result from the fact that facealignment algorithms fail to work perfectly for faces in extremeroll angles.For the second problem, we utilize gender information tocurate a noisy large-scale face dataset. Specifically, we find thatthe curated MS-Celeb1M [21], [27] still contains many noisylabels where some subjects are still mixed with images fromdifferent genders. Training using data with these noisy labelsmay potentially hurt the discriminative capability of deep modelsand degrade their performance, especially in low FAR regions(10-5, 10-6, etc). Therefore, we leverage gender information tofurther curate the training set and remove those subjects mixedwith images of both males and females. First, we predict genderprobability for each image in the training set using the multi-task face network proposed in [40]. Since gender prediction maybecome inaccurate when gender probabilities are near 0.5, we onlyconsider faces with gender probability greater than 0.6 (male) orsmaller than 0.4 (female). Then for each subject, if the percentageof faces from the minority gender exceed the threshold (3% in ourexperiment), we remove all the face images of the subject fromthe training set. After retraining the model using the curated data, the performance improves at low FARs.The main contributions of this paper are summarized asfollows: • We comprehensively study the effects of seven covariateson the performance for unconstrained face verification.The datasets we use are the largest public covariate facedatasets, which allows evaluation at very low FARs ( − , − ). We test all the covariates using the state-of-the-artdeep models. This gives an insight on the limitations ofmany existing deep CNNs for face covariates. • We study the mixed effects of multiple covariates. This isan important problem for unconstrained environment yetnot deeply explored by previous studies. • We propose to utilize gender information to help effec-tively curate the training data and achieve better perfor-mance.The rest of the paper is organized as follows. A brief review ofliteratures on covariate analysis for face verification is presented inSection 2. In Section 3, we introduce five state-of-the-art DCNNsfor face verification and the ways to fuse the similarity scores fromthem. A method for how to utilize gender covariate for training setcuration is presented in Section 4. Experimental results on threedifferent covariates datasets are shown in Section 5 and finally wesummarize and conclude the paper in Section 6.
ELATED W ORKS
Several prior works discussed the effects of covariates on facerecognition performance [1], [7], [8], [15], [16], [18], [30].Gross et al. [18] evaluated two algorithms on three face datasetsand discussed five covariates: pose, illumination, expression, oc-clusion and gender. They varied each covariate with other factorsfixed and examined performance changes for two algorithms. Sim-ilarly, Beveridge et al. [7], [8] applied a statistical approach calledthe Generalized Linear Mixed Model (GLMM) to analyze twotypes of covariates: subject covariates (e.g., gender, race, wearingglasses) and image covariates (e.g., image size ratio, the numberof pixels between eyes). Three algorithms were tested, and theyclaimed that effects of covariates for different algorithms variedsignificantly. In [15], Givens et al. split faces into three groups(good, bad and ugly) based on the performance of their verificationrates. They used GLMM to analyze the underlying effects of dif-ferent covariates over these three groups. They showed that manycovariate effects on verification performance are universal acrossthree groups. Different from the previous works that use statisticalmethods to analyze the covariates, Lui et al. [30] presented a meta-analysis for six covariates on face recognition performance bysummarizing and comparing different papers. In order to guaranteethat the conclusions are meaningful, they restricted their analysisto frontal, still, visible light images. In [1], Abdurrahim et al. reviewed recent research on demographics related covariates (age,race, and gender). They drew similar conclusions as in [30] formost covariates (e.g., age, gender) while they also pay attention tointeractions among demographics covariates. In [16], Grm et al. analyzed the effects of some covariates related to image quality(like blur, occlusion, brightness) and model characteristics (likecolor information). They used the Labeled Face in the Wild(LFW) [25] dataset to synthesize degraded images and comparethe robustness of four widely used DCNNs to each covariate. Inthe following subsections, we briefly review the main findings ofrelated works for each specific covariate.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
Fig. 1. System pipeline for unconstrained face verification.
Studies on effects of pose variations on face recognition have beenreported in [11], [17], [26], [46]. Pose variations generally involveyaw, roll and pitch. Among them, yaw and pitch variations areout-of-plane rotations while roll corresponds to in-plane rotation.Normally, roll variations can be eliminated by applying facealignment using similarity or affine transform to warp the face intopre-defined canonical coordinates while yaw and pitch variationsare much harder to rectify and thus have a larger impact on theperformance than roll in face recognition. Recent studies show thateven the best deep-learning based face models are still severelyaffected by large pose variations [26], [46].
The effects of age on face verification performance are usuallystudied in two ways: aging and age groups. Aging effects arebest analyzed in cross-age face verification scenario because ittries to recognize faces from different ages for the same subject.This is a challenging problem because for most subjects theirface appearance changes tremendously as they become older [6],[28], [38]. In contrast, age group effects refer to the difficulty inrecognizing people from different age groups. This study aims toexplore whether a certain age group is harder to be recognizedthan other groups [6], [8], [15], [30].
It has been revealed by almost all studies that age variations forthe same person impair verification performance. However, theeffects may not be significant if the age differences are withinseveral months [20]. Although aging effects become substantialif the acquisition time difference exceeds several years, there arestill some features preserved on faces that can be utilized for faceverification [6]. Some works tried to reduce the intra-subject agingeffects for face verification by discriminative learning or featureselections [12], [38]. Best-Rowden et al. applied the mixed-effectsmodels to analyze aging effects using a large mugshot dataset.They showed that the average similarity score of genuine pairsdecreases significantly with increasing elapsed time between agallery and probe. However, they found that on average the genuine pairs can still be recognized at FAR = 0.01%, when theelapsed time is no more than 15 years.
The effects of age groups have been discussed by many studies.Interestingly, different from many other covariates where differentstudies show different results, most studies have come to similarconclusions on age group effects: older subjects are usually easierto recognize than younger subjects [1], [7], [8], [30]. However,most of the experiments were conducted in an environment whereage distributions are very imbalanced and the number of samplesfor young people is much larger than old people. The imbalanceincreases the difficulty of verification for young people. In [23],Ho et al. did experiments with each age group evenly distributed.They found that the performance for young ages and old ages didnot show statistically significant difference.
Gender is one of the intrinsic characteristics of a human face.A man’s face is different from a woman’s face in terms ofshape, facial part distance and facial hair. However, studies on theeffects of gender on verification performance have led to differentconclusions. Lui et al. [30] summarized covariates research papersfrom 2001 to 2008. Seven studies found men were easier torecognize [7], [8], while five claimed women were easier [7], [8],[9], and six reported that gender shows no effects on face recog-nition performance [7], [8], [13], [14]. More recently, Grother etal. [19] evaluated seven commercial algorithms and five of themwere more accurate on males. On the other hand, gender is alsoshown to correlate with other covariates like age [30]. In [37],Phillips et al. reported that performance difference between malesand females decreases as people age.
Race and skin tone are also demographic covariates that repre-sent subject-specific characteristics of people. There were severalstudies on the effects of races and skin tones on face verificationperformance, but few of them can be clearly interpreted [1], [30].This is mainly due to the fact that most datasets are very biased
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 with respect to race distribution. In [30], all the datasets theystudied contain more Caucasians than East Asians with a ratioof 3 to 1. Therefore, even if East Asian outperforms Caucasians inall the cases in [30], it is still hard to conclude that East Asians areeasier for verification. In another paper [19], Grother et al. foundthat the influences of race on the performance are conflicted fordifferent algorithms. African American are more easily recognizedthan Caucasians for five out of six algorithms. American Indiansand Asian are easier to be recognized for three algorithms butare more difficult for one algorithm. These results may be simplydue to different training processes where algorithms are superiorfor some races over others. There is also one paper studying theinfluence of skin tones on face verification [5]. Bar-Haim et al. reported that the effects of skin tone on verification performanceare not as important as other unique facial features for certainraces.
Occlusion could be caused by wearing glasses/sunglasses, masks,scarves or by hairstyle (like bangs). It has been widely investigatedthat occlusion of key facial parts can substantially degrade theverification performance [7], [8], [16], [47]. However, differentalgorithms are not sensitive to occlusions to the same degree [8],[16], [47]. There is also one study reporting that consistentlywearing glasses may help improve verification performance forfaces acquired in outdoors [8].
The effects of indoor/outdoor are related to some other imagecovariates like illumination, resolution, and blur. Most studiesrevealed that indoor performance is generally better than out-door [8], [9], [29], [30]. Moreover, the indoor/outdoor effect isalso found to correlate with other covariates. In [8], [9], Bev-eridge et al. reported that recognition performance under outdoorenvironments often favors high resolution images while for lowresolution images, indoor environments are preferred. Anotherfinding reported in [8] is that indoor/outdoor taxonomy also affectsverification performance for different genders and sometimes mayeven reverse the trends.
Studies on facial hair effects are limited compared to othercommon covariates. Earlier studies [13], [14] suggested that per-formance is better when facial hair exists in at least one of theimages. However, the underlying reason for this result is unclearbecause facial hair is not a unique biometric for recognition andcan be changed easily.
VALUATION P IPELINE O VERVIEW
Before addressing the first problem, we briefly introduce thefive deep networks that we used to perform unconstrained faceverification over covariates. Before feeding a face image into thesenetworks, preprocessing steps including face detection, faciallandmark detection and face alignment are performed by using themulti-task CNN framework proposed in [40]. More details aboutthe multi-task CNN are provided in Section 3.1.5. After featureextraction, we applied triplet probabilistic embedding (TPE) [41]on the deep features to further improve the face verificationperformance. The TPE aims to learn a projection matrix W by minimizing a negative log-likelihood objective function. The ideaof TPE is to push positive pairs closer and negative pairs fartherapart by selecting anchor and positive/negative samples. Moredetails can be found in [41]. The end-to-end system pipeline isillustrated in Figure 1. To capture the different characteristics of faces, we use featuresextracted from five state-of-the-art deep neural networks. Thesefive networks have different architectures and training sets withtheir own strengths and weaknesses. Details of each networkarchitecture are presented next.
To train the deep networks, we use UMD-Faces [3], [4],Megaface [33], and MS-Celeb-1M [21]. In addition, we foundthat directly using the original MS-Celeb-1M dataset for trainingdoes not achieve good performance because the labels are verynoisy. Therefore, we used a curated version of MS-Celeb-1Mdataset which is done using a clustering method introduced in [27].The curated dataset contains about 3.7 millions face images from57,440 identities. After curation, many noisy labels are removedwhile sufficient amount of face images with different variationsare retained.
This network employs ResNet-27 model introduced in [50]. Wemodify the original model by removing the center loss andreplacing the softmax loss with the L -softmax loss introducedin [39]. The L -softmax loss enables the learned features tolie on a hypersphere with a fixed radius α before feeding intothe softmax classifier. Since the norm of the features for hardsamples is usually smaller than easy samples when applyingsoftmax loss [39], enforcing the L -softmax loss ensures that thetraining process will focus more on hard samples and significantlyimproves the verification performance [39]. In addition, we alsoadd one more fully connected layer with 512-D before L -softmaxlayer to reduce the feature dimension and the total number ofmodel parameters. For the input size, we change the original sizeof × to × for improved face alignment. To trainthe model, we use a curated version of the MS-Celeb-1M datasetdescribed in Section 3.1.1, which contains . million imagesfrom , subjects. The second network uses the ResNet-101 [22] architecture as thebase network. CNN-2 is deeper than CNN-1 and has larger inputsize of dimensions × . The basic blocks for CNN-2 usebottleneck structures to reduce the number of model parametersand achieve deeper networks given certain memory constraints.Similar to CNN-1, CNN-2 also replaces the original softmax losswith the L -softmax loss and adds an extra fully connected layerbefore the L -softmax layer. CNN-2 is trained using two differenttraining sets and thus two different models are obtained. Onemodel is called CNN-2 S because a small training set is used(curated MS-Celeb-1M dataset) and the other model is calledCNN-2 L because it uses a larger training set (curated MS-Celeb-1M dataset, and Megaface). OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Fig. 2. Examples of hard negative pairs for low detection confidencebut have high similarity scores. ds indicate the detection scores for theimages and S represents similarity score for each pair. The Inception-ResNet-v2 [48] model is used as the base network.This model combines the inception architecture with residualconnections and achieved state-of-the-art performance on theImageNet classification challenge. In addition, scaling layers arealso included in the network architecture which scale down theresiduals for more stable training. We adapt the Inception-ResNet-v2 model by adding a 512-D fully connected layer before thelast layer. The training set contains over six millions images fromabout 58,000 subjects. These images includes a mixture of about3.7 million still images from the curated MS-Celeb-1M datasetin Section 3.1.1, about 300,000 still images from the UMDFacesdataset [4], and about 1.8 million video frames from the extensionto the UMD-Faces Video dataset [3].
This network is based on the all-in-one CNN architecture [40].The model is trained in a multi-task learning framework whichutilizes the correlations among different tasks to learn a morerobust model than learning each task individually. Lower layersof the network are shared for all the tasks to produce a genericrepresentation while intermediate layers are only shared amongmore related tasks. Each task also has its task-specific layersand losses. In this paper, we mainly utilize the face detection,facial landmark detection branches for face alignment, and theface recognition branch to generate face features. We also usethe gender classification branch to estimate gender probabilities.The face detection and facial landmark detection branches sharethe first six layers and have two separate fully connected layersfor each task. The face recognition branch consists of sevenconvolutional layers followed by three fully connected layers.Same training set is used as for CNN-1 and CNN-2 S.
After we obtained the extracted features from the learned deepnetworks and the embedding matrix W from TPE [41], thesimilarity scores for each pair { x i , x j } is computed by simplyusing the cosine similarity of the two embedded features: s ij = ( W x i ) T ( W x j ) (cid:107) W x i (cid:107) (cid:107) W x j (cid:107) (1)In the last stage of the proposed system, we fuse the scorescomputed from the five networks as the final similarity score.We observe that the similarity scores may become unreliablewhen the image qualities are poor. Meanwhile, we find the facedetection scores obtained from the face detection branches of theCNN-4 is a good indication of image quality. More specifically,low detection scores usually indicate low quality of the detected Fig. 3. Instances of subjects with noisy labels faces (low resolution, extreme pose or severe blur) and the deepmodels failed to extract useful facial features from theses faces.Figure 2 shows some hard negative pairs with low detection scoresbut high similarity scores. We notice that the main reason forthe high similarity scores is that these pairs are all very blurredand each pair has similar background. To address this issue, wereweight the similarity scores when the face detection scores ofthe corresponding pairs are low. ˆ s i = (cid:26) s i , if ds > thrαs i , otherwise, (2)where ds is the minimum of the detection scores for the pairof faces, thr is the threshold, α is the reweight coefficient.Then we simply average the reweighted similarity scores fromthe five networks to get the final results. s = 15 (cid:88) i ˆ s i (3) ERFORMANCE I MPROVEMENT BY E XPLOITING G ENDER I NFORMATION
Although many noisy labels are removed after curating the trainingset using the clustering method as mentioned in Section 3.1.1,there still exists many noisy labels which cannot be handled byclustering. Figure 3 shows an example of a subject which stillcontains noisy labels. We can see that those mislabeled faceimages look very similar to the correctly-labeled faces. Sincethe clustering method mainly determines a cluster based on theappearance similarity between faces, it is very hard to discoverthese mislabeled images.However, we observe that some mislabeled images have dif-ferent genders compared to the correctly-labeled images. Thismotivates us to further curate the training set by exploiting thegender information. First, gender probabilities are estimated usingthe all-in-one face network [40] for all the face images in the pre-curated MS-Celeb-1M dataset in 3.1.1. Since gender estimationmay become unreliable when gender probabilities are near 0.5,we only consider faces with gender probability greater than 0.6(male) or smaller than 0.4 (female). For each subject, if the numberof faces from the minority gender accounts for more than 3% ofthe total number of faces, we eliminate the whole subject. In total,we removed 248,059 faces from 4,160 subjects.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Fig. 4. Sample images for IJB-B and IJB-C datasets.
XPERIMENTAL R ESULTS
To analyze the covariate effects on unconstrained face verificationperformance, we evaluated the five deep networks on three chal-lenging face datasets that have face verification covariate proto-cols: the IARPA JANUS Benchmark B (IJB-B) 1:1 covariates [51],the IARPA JANUS Benchmark C (IJB-C) 1:1 covariates [32] andthe Celebrities in Frontal-Profile in the Wild (CFP) [43]. IJB-Band IJB-C 1:1 covariates both contain seven covariates protocolswhile the CFP dataset mainly focuses on extreme pose variations.
The IARPA JANUS Benchmark B (IJB-B) dataset [51] is amoderate-scale unconstrained face dataset with face detection,recognition and clustering protocols. It consists of 1845 subjectswith human-labeled ground truth face bounding boxes, eye/noselocations, and covariate meta-data such as occlusion, facial hair,and skin tone for 21,798 still images and 55,026 frames from7,011 videos. The 1:1 covariates protocol of IJB-B aims toanalyze the effects of seven different covariates (i.e., pose (yawand roll), age, facial hair, gender, indoor/outdoor, occlusion (noseand mouth visibility, eyes visibility, and forehead visibility), andskin tone.) on face verification performance. The protocol has20,270,277 pairs of templates (3,867,417 positive and 16,402,860negative pairs) which enables us to evaluate algorithms at lowFARs region of ROC curve (e.g., FAR at 0.001% and 0.0001%).Each template contains only one image or a video frame. Somesample images are shown in the first row of Figure 4. The IARPAJANUS Benchmark C (IJB-C) dataset [32] is an extended versionof the IJB-B dataset, and contains more subjects and pairs forevaluation. It consists of 3,531 subjects containing 140,739 imagesand video frames. The 1:1 covariates protocol has 47,404,001 pairof templates (7,819,362 positive and 39,584,639 negative pairs).Some sample images are shown in the second row of Figure 4.To understand the effects of different covariates on face ver-ification performance, in addition to the identity label (positiveor negative) for each pair of templates, covariate labels are alsoassigned to each pair. To analyze a certain covariate (like gender),all pairs are split into groups based on the value of covariatelabels (female = 0 and male = 1). The ROC curve is drawn foreach group and the performance difference among different groupsreflects the effects of the covariates. When we evaluate the generalperformance of an algorithm, all the pairs are mixed togetherwithout specifying separate covariate labels. In the followingsections, we first present our experimental results on the overall protocol where covariate labels are not involved and then delveinto the details of each covariate result.
To compare the performance of five deep networks, we presentthe ROC curves for each network and their score-level fusion. Fordetection score-based fusion, threshold thr is set to 0.75 and thereweight coefficient α is set to 0.8. We also did a sensitivity analy-sis on these two parameters in Section 5.2.4. Figures 5(a) and 5(b)show the performance for IJB-B and IJB-C 1:1 covariates respec-tively. From both figures, we observe that CNN-2 S performs verywell at high FARs of the ROC curve, but the performance dropsrapidly at low FARs. In contrast, CNN-1, CNN-3 and CNN-4have smoother curves and perform better at low FARs but worseat high FARs than CNN-2 S. Meanwhile, CNN-2 L shows verystrong performance for all FARs and outperforms the other fournetworks in middle range of FARs (FAR= − , − ). Moreover,the fusion results of the five networks outperform all individualmodels, especially at low FARs of the ROC curve for the IJB-C dataset. This demonstrates the complementary behavior of thedifferent models and fusion can always yield some improvementsover individual models. By comparing the ROC curves of IJB-B and IJB-C datasets, we can see similar trends when FARsare larger than − but performance for IJB-B drops faster atlow FARs of the ROC curve. In addition, at low FARs, differentalgorithms perform very differently for IJB-C but similarly forIJB-B dataset. This may be due to the IJB-B dataset containingmore hard negative pairs that cannot be handled by any of thesealgorithms. To test the efficiency of the dataset curation method discussedin Section 4, we retrain CNN-1 using the training set curated byexploiting gender information and compare with results obtainedbefore curation. From Table 1 it can be seen that the performanceis improved at low FARs of ROC curves after training set curationon both IJB-B and IJB-C datasets. Since the goal of gender-based curation is to improve the model’s capability to distinguishmale and female subjects who looks very similar, performanceimprovements at low FARs are consistent with this goal becauseit indicates that the model can deal with hard negative pairs ina better way. On the other hand, we notice that the performanceimprovements on IJB-C are larger than on IJB-B, which meansthe gender information is more useful to detect the hard negativepairs in IJB-C than in IJB-B.
We compare our fusion results with some other competitivemethods and the performance for IJB-B and IJB-C are shown inTable 2 and Table 3 respectively. Although there exist many facenetworks (e.g., DeepID3 [45], Pose-Aware Face Networks [31]),their model are not publicly available. Therefore, we tested twowidely used public models: VGG-Face [36] and Center-Face [50].More specifically, we used the pretrained models provided byauthors to extract features and followed their preprocessing stepson face images. It is clearly shown in Table 2 and Table 3 thatour fusion results outperform both VGG Face and Center-Faceby large margins. There are two main reasons for this dramatic
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e CNN-1CNN-2_sCNN-2_lCNN-3CNN-4Fusion of the Five (a) ROC curves for IJB-B 1:1 covariates
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e CNN-1CNN-2_sCNN-2_lCNN-3CNN-4Fusion of the Five (b) ROC curves for IJB-C 1:1 covariates
Fig. 5. ROC curves for IJB-B and IJB-C 1:1 covariates overall protocol without specifying separate covariate labels. The fusion results are obtainedby detection-score based fusion of the five CNN networks. The figures are best viewed in color.
Method TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − IJB-B before curation
IJB-B after curation 0.0245
IJB-C after curation
TABLE 1Performance comparison between before and after gender based training set curation on IJB-B and IJB-C 1:1 covariate overall protocol. All theresults are generated using the CNN-1 architecture.
Method TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − VGG-Face 0.0150 0.0440 0.0994 0.1515 0.2190 0.3318 0.5723Center-Face 0.0063 0.0353 0.0780 0.1363 0.2370 0.4206 0.7501Center-Face(retrain) 0.0517 0.1656 0.3880 0.6014 0.7620 0.8692 0.9460Fusion of our five model
TABLE 2Performance comparison for different methods on IJB-B 1:1 covariate overall protocol. Our fusion results are generated by detection score-basedfusion of the five deep models. VGG-Face and Center-Face results are derived by applying their pretrained model to extract features and followingthe IJB-B 1:1 covariate overall protocol. Center-Face(retrain) is retrained using the curated MS-Celeb-1M dataset and the Center-Face model.
Method TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − TAR@FAR = − VGG-Face 0.0513 0.0792 0.1159 0.1616 0.2275 0.3396 0.5918Center-Face 0.0479 0.0652 0.1005 0.1629 0.2746 0.4739 0.7733Center-Face(retrain) 0.2417 0.3596 0.5023 0.6403 0.7660 0.8624 0.9368Fusion of our five model
TABLE 3Performance comparison for different methods on IJB-C 1:1 covariate overall protocol. Our fusion results are generated by detection score-basedfusion of the five deep models. VGG-Face and Center-Face results are derived by applying their pretrained model to extract features and followingthe IJB-C 1:1 covariate overall protocol. Center-Face(retrain) is retrained using the curated MS-Celeb-1M dataset and the Center-Face model. performance difference. First, we employ deeper models andvarious architectures to capture different characteristics of facesand conduct score-level fusion to further boost the performance.Second, the training set we use contains more faces with diverseface variations. In order to disentangle the effect of using differenttraining sets, we retrain the Center-Face model using the curatedMS-Celeb-1M dataset. The results for IJB-B and IJB-C datasetare shown in Table 2 and Table 3. We can see significant im-provements in performance compared to the pretrained model, butthe proposed fusion method still outperforms the retrained modelsignificantly.
Detection score threshold thr reweighting ceofficient α TAR@FAR 0.70 0.75 0.80 0.85 0.75 0.80 0.85 0.90 0.95 − − − − − − − TABLE 4Performance variations when detection score threshold thr (left) andreweighting coefficient α (right) varies OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallyaw_diff [0 o ,15 o ]yaw_diff [15 o ,30 o ]yaw_diff [30 o ,45 o ]yaw_diff [45 o ,90 o ] (a) ROC curves with yaw difference changes for IJB-B False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallyaw [0 ° ,15 ° ]yaw [15 ° ,30 ° ]yaw [30 ° ,45 ° ]yaw [45 ° ,90 ° ] (b) ROC curves with absolute yaw angle changes for IJB-B Fig. 6. ROC curves when the yaw difference between two face images changes and when absolute yaw angle of faces changes. The range is from ◦ to ◦ because we average the features for original face and its mirrored image as the final face representation. The absolute yaw angles arecomputed by averaging two faces. The dashed line represents the results for the overall protocol. False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallroll_diff [0 o ,15 o ]roll_diff [15 o ,30 o ]roll_diff [30 o ,45 o ]roll_diff [45 o ,180 o ] (a) ROC curves with roll difference changes for IJB-B False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallroll_diff [0 o ,15 o ]roll_diff [15 o ,30 o ]roll_diff [30 o ,45 o ]roll_diff [45 o ,180 o ] (b) ROC curves with roll difference changes for IJB-C Fig. 7. ROC curves when the roll angle difference between two face images changes. The range is from ◦ to ◦ . The dashed line represents theresults for the overall protocol. When we do the detection-score based fusion in 2, there aretwo parameters: thr and α . Here, we present the results of anablation study on the sensitivity of these two parameters and theperformance for different parameter settings is shown in Table 4.All the performance is reported for IJB-B dataset using CNN-2.We observe that the threshold thr does not severely affect theperformance. In contrast, decreasing the reweighting coefficients α can significantly improve the performance at low FARs ( − , − ) while slightly decreasing the performance at high FARs.This supports the effectiveness of our fusion strategy. To evaluate the effects of pose variations on face verificationperformance, the protocol provides yaw and roll angles for eachface. Since we use the average of the features for original face andits mirrored version as the final face representation, this restrictsthe range of yaw to [0 ◦ , ◦ ] and roll to [0 ◦ , ◦ ] . Based onthe yaw difference between a pair of faces, we divide all pairs into four groups: [0 ◦ , ◦ ] , [15 ◦ , ◦ ] , [30 ◦ , ◦ ] , and [45 ◦ , ◦ ] .Similarly, pairs are also divided into four groups based on rolldifference: [0 ◦ , ◦ ] , [15 ◦ , ◦ ] , [30 ◦ , ◦ ] , and [45 ◦ , ◦ ] .From Figure 6(a), we observe that the yaw difference betweena pair of faces significantly affect face verification performance.For both IJB-B and IJB-C, the ROC curves decrease monotoni-cally as the yaw difference between two faces increases. Moreover,the performance drops much faster when the yaw difference islarger than ◦ . This supports the following two findings: a)deep face representations are more robust to yaw changes (lessthan ◦ ) than traditional face representations such as LBP [2](usually less than ◦ ); b) state-of-the-art deep networks are stillsensitive to large yaw variations (larger than ◦ ). In addition toyaw difference between two faces, another key factor that mayinfluence the performance is the absolute yaw value of faces.In other words, even if the yaw difference between two facesis relatively small (less than ◦ ), the performance may still beaffected when the absolute yaw angles for both faces are large. Inorder to separate this factor from yaw difference, we further splitthe group of yaw difference [0 ◦ , ◦ ] into four subgroups based OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 on their absolute yaw angles: [0 ◦ , ◦ ] , [15 ◦ , ◦ ] , [30 ◦ , ◦ ] ,and [45 ◦ , ◦ ] , where the degrees are computed by averagingthe absolute yaw angles of a pair of faces. The ROC curves areshown in Figure 6(b). Similar to the effect of yaw difference,the absolute yaw angles of faces larger than ◦ causes a largeperformance drop while performance are not affected much whenyaw is less than ◦ . By comparing Figures 6(a) and 6(b), wehave another interesting finding: performance for absolute yawangles in [45 ◦ , ◦ ] and for yaw difference in [45 ◦ , ◦ ] arecomparable, which means that as long as at least one of the twofaces has extreme yaw angle, the performance will be poor. Thisresult demonstrates that face images with extreme yaw angles( [45 ◦ , ◦ ] ) are hard for face matching regardless of the yawdifference because a large part of facial information is missing.Figure 7 shows the face verification performance for variousroll difference between two faces. We find that performance isbetter for groups whose roll differences are smaller than ◦ . Thisresult is surprising because in general the roll difference should notaffect the face verification performance since 2D face alignment isperformed before face matching to normalize all faces to have thesame roll angle. However, the performance drop when increasingthe roll difference shows that facial landmarks may not be accurateso that faces are not normalized as expected when the roll angle islarge. In gender evaluation protocol, female pairs are assigned as label0 and male pairs are assigned as label 1. In order to obtain validROC curves, the protocol does not consider the group when apair of faces have different genders. This is because if two imagesare from different genders, they cannot form a positive pair andan ROC curve cannot be drawn without positive pairs. FromFigure 8(a), it can be observed that performance for males ismuch better than females’ on both IJB-B and IJB-C datasets. Apossible explanation for this result is that women’s faces are oftenoccluded by their long hair and their face appearance are changedby makeup.
The 1:1 covariate protocol labels the test pairs into seven cate-gories based on their age distribution. Ages in [0 , , [20 , , [35 , , [50 , , are labeled as 1, 2, 3, 4, and 5 respectively.Ages that are different for two faces in a pair are labeled as -1.Label 0 represents the group of unknown ages. Results for IJB-B dataset are shown in Figure 8(b). Due to space limitations, wedid not include the IJB-C plots here because they show similarresults as IJB-B. The dashed line represents performance for theoverall protocol while the solid lines present curves for differentage groups. It is shown that performance goes up when agegroups change from 1 to 3. In contrast, the curves begin to fallfrom groups 3 to 5. It means the middle-age group (group 3)is the easiest one to be recognized while too young or too oldsubjects are both challenging for face verification. One possibleexplanation for this result may be because new born babies alllook very similar and their unique facial features begin to emergeas they grow. However, when people become older, some commonfeatures for old people like wrinkles and sagging skins impair theuniqueness of their facial characteristics, which may make oldersubjects harder to be distinguished. In addition, we notice that agegroup -1 (ages of two images are different.) performs similarly as the the overall protocol, which means cross-age face verificationis as hard as the general case. Nonetheless, this dataset does notfully explore the difficulty of cross-age face verification becausethe IJB-B and IJB-C datasets do not have images from the sameperson across large age gaps. For skin tone, the protocol defines six classes: (1) light pink, (2)light yellow, (3) medium pink/brown, (4) medium yellow/brown,(5) medium dark brown, and (6) dark brown. Similar to gender,skin tone also does not contain group -1 because two images withdifferent skin tones cannot form a positive pair. From Figure 9,we observe that performance for different skin tone groups showdifferent trends on IJB-B and IJB-C. For IJB-B, the ROC curvesfor different groups are well separated. Since the skin tone changefrom light to dark for group 1 to 6, a general trend is thatperformance falls when the skin tone becomes darker. However, acounterexample is skin tone group 6 (darkest), which performsbetter than group 2 to group 5. In contrast, for IJB-C, exceptgroup 1 and group 5 which have the same trends as IJB-B, theperformance for other skin tone groups is very close. Thus, wecan only draw the conclusion that skin tone group 1 is the easiestand skin tone group 5 is the hardest for face verification. However,since defining or recognizing skin tones is ambiguous sometimes,it is hard to decide which skin tone is easier for face verificationonly from these results.
To evaluate the effects of occlusion, the protocol tests three typesof visibilities for different facial parts: eyes visibility, mouth andnose visibility, and forehead visibility. Label 0 (1) represents thepart is invisible (visible) for two images, and label -1 means thepart is visible for one image but not for the other. The ROC curvesfor mouth and nose, and forehead visibility of IJB-B dataset arepresented in Figures 10(a) and 10(b) respectively. Due to spacelimitations, we did not include the IJB-C plots here because theyshow similar results as IJB-B. In Figures 10(a) and 10(b), similarresults are shown for mouth/nose and forehead visibility: class -1 and 0 have comparable performance but are worse than class1, which means that performance falls by large margins if nose,mouth or forehead are occluded for at least one of the images.This result indicates the importance of the visibility of key facialparts for recognizing faces.
There are four classes for evaluation in facial hair protocol: class0 represents no facial hair, while class 1, 2, and 3 representmoustache, goatee and beard respectively. Label -1 means facialhair classes are different for two images. Some sample imagesfor moustache, goatee and beard are shown in Figure 12. FromFigure 11(a), we observe that performance is not very sensitiveto facial hair changes. This result demonstrates that facial hairdoes not change the key features of faces and state-of-the-art deepmodels can handle most facial hair variations.
The last covariate we evaluate in the protocol is indoor/outdoor.Outdoor is labeled as 0 and indoor is 1. Label -1 means one image
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallgender = 0gender = 1 (a) ROC curves with different genders for IJB-B False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallage = -1age = 0age = 1age = 2age = 3age = 4age = 5 (b) ROC curves with age changes for IJB-B Fig. 8. ROC curves for different genders and for the case of age variation. For gender, female pairs are labeled as 0 and male pairs are labeled as1. The dashed line represents the results for the overall protocol. For age covariate, ages in [0 , , [20 , , [35 , , [50 , , are labeled as 1,2, 3, 4, and 5 respectively. Ages that are different for two images in a pair are labeled as -1. Label 0 represents unknown ages for the pairs. False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallskin tone = 1skin tone = 2skin tone = 3skin tone = 4skin tone = 5skin tone = 6 (a) ROC curves with skin tone changes for IJB-B False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallskin tone = 1skin tone = 2skin tone = 3skin tone = 4skin tone = 5skin tone = 6 (b) ROC curves with skin tone changes for IJB-C Fig. 9. ROC curves with change in skin tone. The dashed line represents performance for the overall protocol while solid lines are curves for differentskin tones. light pink, light yellow, medium pink/brown, medium yellow/brown, medium-dark brown and dark brown are labeled as 1, 2, 3, 4, 5, 6respectively.
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallnose mouth visible = -1nose mouth visible = 0nose mouth visible = 1 (a) ROC curves with nose/mouth visibility changes False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallforehead visible = -1forehead visible = 0forehead visible = 1 (b) ROC curves with forehead visibility changes Fig. 10. ROC curves corresponding to nose/mouth and forehead visibilities for IJB-B dataset.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallfacial hair = -1facial hair = 0facial hair = 1facial hair = 2facial hair = 3 (a) ROC curves with facial hair changes for IJB-B False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e overallindoor = -1indoor = 0indoor = 1 (b) ROC curves with indoor/outdoor changes for IJB-B Fig. 11. ROC curves for the case of facial hair variation and for indoor/outdoor covariate. For facial hair, label 0 represents no facial hair, while label1, 2, and 3 represent moustache, goatee and beard respectively. For indoor/outdoor, outdoor is labeled as 0 and indoor is 1. Label -1 means oneimage is taken indoor and the other outdoor.Fig. 12. Sample images for different facial hair types. The four imagescorresponds to no facial hairs (0), moustache (1), goatee (2) and beard(3) respectively. is taken indoor and the other outdoor. Performance is shown inFigure 11(b). We can see that the performance of class 1 is muchbetter than class 0 and -1. This implies that indoor images areeasier for face verification. There are two possible reasons for thisresult. First, outdoor images could be easily over-exposed and losesignificant facial information. Second, outdoor images are oftentaken by hand-held cameras when people are walking. In contrast,indoor images are usually captured using tripod or at least withoutmuch motion. So the image quality for indoor images is oftenbetter than outdoor images.
In the unconstrained environment, multiple face covariates areoften correlated with each other which may affect the perfor-mance. It has been found that some covariates may show differenttrends on face verification performance when other covariates areconsidered together [8], [37]. To study the correlation among thedifferent covariates, we chose four pairs of related covariates andevaluated their interactive effects: gender and age, gender and skintone, indoor (outdoor) and nose mouth/forehead visibility, indoor(outdoor) and yaw angle difference. Due to space limitation, allexperimental results are reported for the IJB-B dataset.
In order to show how gender and age influence each other, we drawthe ROC curves in Figure 13(a) for each possible combination of values from genders and age groups. Different age groups arerepresented using different colors and male/female is showed insolid/dashed lines. First, we fix the gender factor and compare theperformance of different age groups for males or females. We seethat males and females show very different trends on age groupeffects. More specifically, for males middle age group [35 , performs best and the performance for old age group [50 , and decreases. In contrast, for females the performance alwaysincreases when age groups get older.Alternatively, we can fix the age group factor and comparethe performance of males and females for each age group. Asobserved in Section 5.4, in general, males achieve superior per-formance than females. However, this finding does not hold forage group [50 , and . For age group [50 , , males andfemales perform comparably while for age group femalesoutperform males. We repeated the procedure discussed above for analyzing thecombination of gender and skin tone. The ROC curves are shownin Figure 13(b). For skin tone groups 4 and 6, performance forfemales is better than that for males, while males performs betterfor group 1, 2 and 5. For skin tone group 3, males and femalesperform similar. This result shows that the combinations of genderand skin tone do not show clear trends and the performance isdependent on datasets.
In addition to the demographic covariates, we are also interestedin the mixed effects of covariates related to extrinsic factors.Figure 14 shows the performance for different indoor/outdoor andnose mouth/forehead visibility combinations. As we already saw,visible nose mouth/forehead and indoor are more favorable forbetter performance. However, these two factors may not haveindependent impacts on performance. From Figure 14, we findthat only when nose, mouth, or forehead is visible and the imagesare taken indoor, the performance is good. Either occlusion oroutdoor can deteriorate the performance.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e age = 1age = 2age = 3age = 4age = 5 (a) ROC curves with age and gender changes on IJB-B False Positive Rate -7 -6 -5 -4 -3 -2 -1 T r u e P o s i t i ve R a t e skintone = 1skintone = 2skintone = 3skintone = 4skintone = 5skintone = 6 (b) ROC curves with skin tone and gender changes on IJB-B Fig. 13. ROC curves corresponding to age and gender (left) changes, and skin tone and gender (right) changes. Color lines represent differentage groups and skin tones where small numbers represent young ages and light skin tones. Females is showed as dashed lines and solid linesrepresent males.
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e nose mouth visible = 1, indoor = 0nose mouth visible = 1, indoor = 1nose mouth visible = 0, indoor = 0nose mouth visible = 0, indoor = 1 (a) ROC curves with indoor/outdoor and nose and mouth visibilitychanges. False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e forehead visible = 1, indoor = 0forehead visible = 1, indoor = 1forehead visible = 0, indoor = 0forehead visible = 0, indoor = 1 (b) ROC curves with indoor/outdoor and forehead visibility changes. Fig. 14. ROC curves corresponding to visibility and indoor/outdoor on IJB-B.
False Positive Rate -6 -4 -2 T r u e P o s i t i ve R a t e yaw_diff [0 o ,15 o ]yaw_diff [15 o ,30 o ]yaw_diff [30 o ,45 o ]yaw_diff [45 o ,90 o ] Fig. 15. ROC curves corresponding to yaw difference and in-door/outdoor. Outdoor is showed as dashed lines and solid lines rep-resent indoor.
The last combination we considered is indoor/outdoor and yawangle difference. The ROC curves are presented in Figure 15. Wenotice that when fixing the indoor/outdoor factor, the performancefor smaller yaw angle difference is always better. On the otherhand, when the yaw angle difference is fixed, indoor faces alwaysoutperform outdoor faces. This result demonstrates that yaw angledifference and indoor/outdoor can affect the face verificationperformance independently and changing any one of the twofactors can affect the performance.
Since pose variation is a key challenging issue for face verification,we also used the Celebrities in Frontal-Profile (CFP) dataset tofurther investigate the underlying effects of extreme pose varia-tions on unconstrained face verification performance. The CFPdataset consists of 7,000 still images from 500 subjects with 14images per subject. For each subject, it has 10 images in frontalpose and 4 images in profile pose. To evaluate the performancefor different poses, the protocol contains two settings: frontal-to-frontal (FF) and frontal-to-profile (FP) face verification. In the
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
Fig. 16. Sample images for CFP datasets. frontal-to-frontal setting, two test images are both in frontal poseand in frontal-to-profile setting, a test pair includes one frontal faceand one profile face. Each setting divides the whole dataset intoten splits and each split consists of 350 positive and 350 negativepairs. Some sample images are shown in Figure 16.
We follow the performance evaluation metrics used in [43] andreport three numbers for each setting: Area under the curve (AUC),Equal Error Rate (EER) and Accuracy. AUC measures the areaunder ROC curves and ranges from 0 to 1 where higher valuecorresponds to better performance. EER is the point where thefalse accept rate is equal to false reject rate. It ranges from 0 to 1with lower values indicating better performance. To get accuracy,we use an optimal threshold to classify all pairs and calculate theclassification accuracy. For the optimal threshold, we chose thevalue that provides highest classification accuracy on the crossvalidation set.
The experimental results for frontal-to-frontal and frontal-to-profile protocols are summarized in Table 5. CNN-1 to CNN-4results are obtained by using the same models and same processingsteps for IJB-B and IJB-C experiments. For the fusion part, sinceall detection scores for the images in CFP dataset is near 1, wesimply average the similarity score for CNN-1 through CNN-4.Deep features and human results are directly cited from [43]. Theperformance is reported by averaging over ten splits.For the frontal-to-frontal setting, CNN-1 to CNN-4 all out-perform both the deep features method and human performancein [43]. CNN-2 S and CNN-3 perform similarly and their perfor-mance is slightly better than CNN-1 and CNN-4. Since perfor-mances of CNN-2 S and CNN-3 have already saturated, fusionresults for the five networks do not change much compared toCNN-2 S or CNN-3. For the frontal-to-profile setting, differentalgorithms begin to show significant difference in performance.CNN-1 results are slightly worse than human performance but are2% better than CNN-4. On the other hand, CNN-2 S and CNN-3 both surpass human performance by more than 2%. Anotherinteresting finding in the results is the performance comparisonbetween frontal-to-frontal and frontal-to-profile settings. Whileperformance for different algorithms do not vary too much infrontal-to-frontal protocol, the performance drops from frontal-to-frontal to frontal-to-profile is quite different among comparedalgorithms. Generally speaking, better algorithms are more robustto extreme yaw variations and always have smaller performancedegradation for frontal-to-profile setting. In particular, CNN-2 Shas the smallest performance drop of 1.6% from frontal-to-frontalto frontal-to-profile, which is similar to human performance.However, if we compare the results with Section 5.3, even thebest results are still severely affected by pose variations. This is because IJB-B and IJB-C datasets contain other challengingfactors and pose variations can still degrade performance oncecombined with these factors. Therefore, even for state-of-the-artface models, there is still room to improve robustness to extremepose variations.
ONCLUSION AND F UTURE W ORK
In this paper, we conducted comprehensive experiments to studythe effects of covariates on unconstrained face verification perfor-mance. We also curated the training data by exploiting genderinformation and achieved improved performance. Experimentalresults on the overall protocols of IJB-B and IJB-C covariateverification tasks show the outstanding performance of five im-plemented deep models and their score-level fusion. This demon-strates that these deep models are more robust to different vari-ations of faces than previous methods. However, when we focuson each specific covariate, we found that many covariates stillsignificantly affect the verification performance. Pose variationsand occlusions are the top confounding factors that could causeperformance drop by large margins. In addition, indoor perfor-mance is much better than outdoors. On the other hand, thedifficulty of unconstrained face verification varies significantly fordifferent demographic groups. Age, gender and skin tone all haveshown impacts on performance. Specifically, males are easier toverify than females and old subjects generally performs better thanyoung ones. For skin tone, light pink achieved best performancewhile medium-dark brown performs worst. However, since IJB-Band IJB-C show very different tendencies on skin tone groups, wemay not be able to draw a clear conclusion on its effects.Most of the findings discussed above confirm the findings ofprevious studies. However, there are also some new findings thatwere rarely mentioned by other studies or somewhat surprising.First, we found that verification performance does not increasemonotonically as subjects get older. In contrast, performancebegins to drop for age group of [50 , and . This result isdifferent from most studies which claim older subjects are alwayseasier to be recognized. However, since most of other studies didnot have a sufficient number of older subjects to analyze, theirresults still make sense because middle age group performs betterthan children and teenagers. Second, we observed that extremeroll angle differences between faces still affect performance sub-stantially. This result is unexpected as roll variations should beeliminated by face alignment. Therefore, we conclude that facealignment performance needs to get better when faces are inextreme roll angles.Finally, we investigated the mixed effects of multiple covari-ates. First, males and females show very different trends on theeffects of age groups. For males, performance first increases thendrops when age goes up while for females, older age groupsalways perform better. On the other hand, the interaction of genderand skin tone does not show clear trends. Second, when we con-sider indoor/outdoor and occlusion together, we found that indoorand nose mouth/forehead visibility must be satisfied simultane-ously to achieve good performance. However, indoor/outdoor andyaw angle difference can affect the performance independently.Some of the results from our studies show several promisingresearch directions. First, apart from the yaw problem, we shouldalso consider the influence of roll when designing face verificationsystems. This can be done by either improved face alignmentor more robust feature extraction models. Second, since gender, OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
Frontal-to-Frontal Frontal-to-ProfileAccuracy EER AUC Accuracy EER AUCDeep features [43] 0.964(0.007) 0.035(0.007) 0.994(0.003) 0.849(0.018) 0.150(0.020) 0.930(0.016)Human [43] 0.962(0.007) 0.053(0.018) 0.982(0.011) 0.946(0.011) 0.050(0.011) 0.989(0.005)CNN-1 0.988(0.002) 0.012(0.004) 0.999(0.001) 0.938(0.012) 0.062(0.013) 0.986(0.005)CNN-2 S
CNN-3 0.994(0.004) 0.006(0.005)
TABLE 5Performance comparison for different methods on CFP dataset. Our fusion results are generated by averaging the four deep models. age and skin tone all have significant impact on performance,we may collect the training set more carefully to improve theperformance on certain demographic groups. Third, we showpreliminary results on how to use gender estimation for trainingdata curation. Other covariates like race may also be used in asimilar way. Moreover, we may combine covariates with clusteringmethod for improved curation performance. A CKNOWLEDGMENT
This research is based upon work supported by the Office of theDirector of National Intelligence (ODNI), Intelligence AdvancedResearch Projects Activity (IARPA), via IARPA R&D ContractNo. 2014-14071600012. The views and conclusions containedherein are those of the authors and should not be interpretedas necessarily representing the official policies or endorsements,either expressed or implied, of the ODNI, IARPA, or the U.S.Government. The U.S. Government is authorized to reproduce anddistribute reprints for Governmental purposes notwithstanding anycopyright annotation thereon. R EFERENCES [1] S. H. Abdurrahim, S. A. Samad, and A. B. Huddin. Review on the effectsof age, gender, and race demographics on automatic face recognition.
TheVisual Computer , pages 1–14, 2017.[2] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with localbinary patterns: Application to face recognition.
IEEE transactions onpattern analysis and machine intelligence , 28(12):2037–2041, 2006.[3] A. Bansal, C. Castillo, R. Ranjan, and R. Chellappa. The do’s and don’tsfor cnn-based face verification. arXiv preprint arXiv:1705.07426 , 2017.[4] A. Bansal, A. Nanduri, C. Castillo, R. Ranjan, and R. Chellappa.Umdfaces: An annotated face dataset for training deep networks. arXivpreprint arXiv:1611.01484 , 2016.[5] Y. Bar-Haim, T. Saidel, and G. Yovel. The role of skin colour in facerecognition.
Perception , 38(1):145–148, 2009.[6] L. Best-Rowden and A. K. Jain. Longitudinal study of automaticface recognition.
IEEE transactions on pattern analysis and machineintelligence , 40(1):148–162, 2018.[7] J. R. Beveridge, G. H. Givens, P. J. Phillips, and B. A. Draper. Factors thatinfluence algorithm performance in the face recognition grand challenge.
Computer Vision and Image Understanding , 113(6):750–762, 2009.[8] J. R. Beveridge, G. H. Givens, P. J. Phillips, B. A. Draper, D. S. Bolme,and Y. M. Lui. Frvt 2006: Quo vadis face quality.
Image and VisionComputing , 28(5):732–743, 2010.[9] J. R. Beveridge, G. H. Givens, P. J. Phillips, B. A. Draper, and Y. M. Lui.Focus on quality, predicting frvt 2006 performance. In
Automatic Face &Gesture Recognition, 2008. FG’08. 8th IEEE International Conferenceon , pages 1–8. IEEE, 2008.[10] J.-C. Chen, R. Ranjan, A. Kumar, C.-H. Chen, V. M. Patel, andR. Chellappa. An end-to-end system for unconstrained face verificationwith deep convolutional neural networks. In
Proceedings of the IEEEInternational Conference on Computer Vision Workshops , pages 118–126, 2015. [11] C. Ding and D. Tao. A comprehensive survey on pose-invariant facerecognition.
ACM Transactions on intelligent systems and technology(TIST) , 7(3):37, 2016.[12] L. Du and H. Ling. Cross-age face verification by coordinating withcross-face age verification. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 2329–2338, 2015.[13] G. Givens, J. R. Beveridge, B. A. Draper, and D. Bolme. A statisticalassessment of subject factors in the pca recognition of human faces. In
Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03.Conference on , volume 8, pages 96–96. IEEE, 2003.[14] G. Givens, J. R. Beveridge, B. A. Draper, P. Grother, and P. J. Phillips.How features of the human face affect recognition: a statistical compari-son of three face recognition algorithms. In
Computer Vision and PatternRecognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE ComputerSociety Conference on , volume 2, pages II–II. IEEE.[15] G. H. Givens, J. R. Beveridge, P. J. Phillips, B. Draper, Y. M. Lui, andD. Bolme. Introduction to face recognition and evaluation of algorithmperformance.
Computational Statistics & Data Analysis , 67:236–247,2013.[16] K. Grm, V. ˇStruc, A. Artiges, M. Caron, and H. K. Ekenel. Strengths andweaknesses of deep learning models for face recognition against imagedegradations.
IET Biometrics , 7(1):81–89, 2017.[17] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie.
Image and Vision Computing , 28(5):807–813, 2010.[18] R. Gross, J. Shi, and J. Cohn. Quo vadis face recognition. In
In ThirdWorkshop on Empirical Evaluation Methods in Computer Vision , pages119–132, 2001.[19] P. J. Grother, G. W. Quinn, and P. J. Phillips. Report on the evaluationof 2d still-image face recognition algorithms.
NIST interagency report7709 , page 106, 2010.[20] G. Guo, G. Mu, and K. Ricanek. Cross-age face recognition on a verylarge database: The performance versus age intervals and improvementusing soft biometric traits. In
Pattern Recognition (ICPR), 2010 20thInternational Conference on , pages 3392–3395. IEEE, 2010.[21] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M: A datasetand benchmark for large scale face recognition. In
European Conferenceon Computer Vision . Springer, 2016.[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer visionand pattern recognition , pages 770–778, 2016.[23] W. H. Ho, P. Watters, and D. Verity. Are younger people more difficultto identify or just a peer-to-peer effect. In
International Conference onComputer Analysis of Images and Patterns , pages 351–359. Springer,2007.[24] G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu, S. S. Mukherjee, T. M.Hospedales, N. M. Robertson, and Y. Yang. Attribute-enhanced facerecognition with neural tensor fusion networks. In
ICCV , pages 3764–3773. IEEE Computer Society, 2017.[25] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled facesin the wild: A database for studying face recognition in unconstrainedenvironments. Technical Report 07-49, University of Massachusetts,Amherst, October 2007.[26] S. Li, X. Liu, X. Chai, H. Zhang, S. Lao, and S. Shan. Maximallikelihood correspondence estimation for face recognition across pose.
IEEE Transactions on Image Processing , 23(10):4587–4600, 2014.[27] W.-A. Lin, J.-C. Chen, and R. Chellappa. A proximity-aware hierarchicalclustering of faces. arXiv preprint arXiv:1703.04835 , 2017.[28] H. Ling, S. Soatto, N. Ramanathan, and D. W. Jacobs. A study of facerecognition as people age. In
Computer Vision, 2007. ICCV 2007. IEEE11th International Conference on , pages 1–8. IEEE, 2007.[29] Z. Liu and S. Sarkar. Outdoor recognition at a distance by fusing gaitand face.
Image and Vision Computing , 25(6):817–832, 2007.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15 [30] Y. M. Lui, D. Bolme, B. A. Draper, J. R. Beveridge, G. Givens, andP. J. Phillips. A meta-analysis of face recognition covariates. In
Biometrics: Theory, Applications, and Systems, 2009. BTAS’09. IEEE3rd International Conference on , pages 1–8. IEEE, 2009.[31] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware facerecognition in the wild. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 4838–4846, 2016.[32] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K.Jain, W. T. Niggel, J. Anderson, J. Cheney, et al. Iarpa janus benchmark–c: Face dataset and protocol. In
Proceedings of the IAPR InternationalConference on Biometrics , 2018.[33] A. Nech and I. Kemelmacher-Shlizerman. Level playing field for millionscale face recognition. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2017.[34] J. Paone, S. Biswas, G. Aggarwal, and P. Flynn. Difficult imagingcovariates or difficult subjects?-an empirical investigation. In
Biometrics(IJCB), 2011 International Joint Conference on , pages 1–8. IEEE, 2011.[35] C. J. Parde, C. Castillo, M. Q. Hill, Y. I. Colon, S. Sankaranarayanan,J.-C. Chen, and A. J. OToole. Face and image representation in deepcnn features. In
Automatic Face & Gesture Recognition (FG 2017), 201712th IEEE International Conference on , pages 673–680. IEEE, 2017.[36] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In
British Machine Vision Conference , 2015.[37] P. J. Phillips, P. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi,and M. BONE. Evaluation report. 2003.[38] N. Ramanathan and R. Chellappa. Face verification across age progres-sion.
IEEE Transactions on Image Processing , 15(11):3349–3361, 2006.[39] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained softmax lossfor discriminative face verification. arXiv preprint arXiv:1703.09507 ,2017.[40] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. Anall-in-one convolutional neural network for face analysis. In
AutomaticFace & Gesture Recognition (FG 2017), 2017 12th IEEE InternationalConference on , pages 17–24. IEEE, 2017.[41] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa.Triplet probabilistic embedding for face verification and clustering. In
Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE 8thInternational Conference on , pages 1–8. IEEE, 2016.[42] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embed-ding for face recognition and clustering. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 815–823, 2015.[43] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W.Jacobs. Frontal to profile face verification in the wild. In
Applications ofComputer Vision (WACV), 2016 IEEE Winter Conference on , pages 1–9.IEEE, 2016.[44] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, andexpression (pie) database. In
Automatic Face and Gesture Recognition,2002. Proceedings. Fifth IEEE International Conference on , pages 53–58. IEEE, 2002.[45] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face recognition withvery deep neural networks. arXiv preprint arXiv:1502.00873 , 2015.[46] Y. Sun, X. Wang, and X. Tang. Deep learning face representation frompredicting 10,000 classes. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 1891–1898, 2014.[47] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations aresparse, selective, and robust. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 2892–2900, 2015.[48] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4,inception-resnet and the impact of residual connections on learning. In
AAAI , pages 4278–4284. AAAI Press, 2017.[49] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing thegap to human-level performance in face verification. In
Proceedings ofthe IEEE conference on computer vision and pattern recognition , pages1701–1708, 2014.[50] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative featurelearning approach for deep face recognition. In
European Conferenceon Computer Vision , pages 499–515. Springer, 2016.[51] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller,N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, J. Cheney, and P. Grother.Iarpa janus benchmark-b face dataset. In
The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) Workshops , July 2017.[52] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition:A literature survey.
ACM computing surveys (CSUR) , 35(4):399–458,2003.
Boyu Lu
Biography text here.
Jun-Cheng Chen
Biography text here.
Carlos D Castillo
Biography text here.