On measuring the iconicity of a face
OOn measuring the iconicity of a face
Prithviraj Dhar Carlos D. Castillo Rama ChellappaUniversity of Maryland, College Park { prithvi,carlos,rama } @umiacs.umd.edu Abstract
For a given identity in a face dataset, there are certainiconic images which are more representative of the subjectthan others. In this paper, we explore the problem of com-puting the iconicity of a face. The premise of the proposedapproach is as follows: For an identity containing a mix-ture of iconic and non iconic images, if a given face can-not be successfully matched with any other face of the sameidentity, then the iconicity of the face image is low. Usingthis information, we train a Siamese Multi-Layer Percep-tron network, such that each of its twins predict iconicityscores of the image feature pair, fed in as input. We observethe variation of the obtained scores with respect to covari-ates such as blur, yaw, pitch, roll and occlusion to demon-strate that they effectively predict the quality of the imageand compare it with other existing metrics. Furthermore,we use these scores to weight features for template-basedface verification and compare it with media averaging offeatures.
1. Introduction
What makes Brad Pitt look like Brad Pitt? A cleanfrontal image of the actor might better represent him, thanan image where he is wearing sunglasses and a large fedora(Figure 1). In this case, the former can be consideredas an iconic image and the latter as a non-iconic one.Predicting face iconicity is a useful task in facial imageanalysis. This problem is difficult because ‘iconicity’ issubjective, and is dependent on the existing images of asubject. Most face recognition and verification systemsare known to perform well for iconic images captured inconstrained environments. However, for measuring theperformance of such systems in real life scenarios, theyshould be evaluated on unconstrained faces. Moreover,the performance of a system can be accurately measuredby taking into consideration the difficulty (based on theiconicity) of the test dataset. So, computation of iconicityof a given facial image is useful for properly evaluatingface verification and recognition systems.In [4], an iconic image for an object is defined as an (a) (b)
Figure 1: (a.) Iconic and (b.) Non-iconic image of BradPitt. Our approach assigns an iconicity score of 0.84 and0.33 to (a) and (b), respectively.image with a large clearly delineated instance of the objectin a characteristic view. Here the authors showed thaticonic images can be identified rather accurately in naturaldatasets by segmenting images with a procedure thatidentifies foreground pixels. But this does not translatewell to identifying iconic facial images. One unsupervisedmethod to segregate iconic and non-iconic face imageswould be to perform global clustering across all identitiesand conclude that the images that are not present in theappropriate identity cluster are non-iconic. However, in thissetting, we cannot compute iconicity of any unseen identity.Moreover, in a dataset containing millions of images (suchas UMD Faces, MS-Celeb-1M) it can be very cumbersometo perform clustering. Thus it would be very helpful tohave a method which can assign an iconicity measure onany unseen image without any additional information.It can be expected that an iconic image is likely to bea high quality face image. Thus, computing iconicity canalso help us develop a notion of face quality. In [5], theauthors propose a framework to regress the quality scoresof a model to human quality values. Although this methodcan directly estimate the face quality of unseen images, itstraining phase is not scalable, as it is expensive to assignhumans to provide ground truth for a regression model.Hence, it is important to design an approach where a blindquality/iconicity prediction model can be trained onlyby optimizing an objective that is only dependent on theinherent properties of a given image. a r X i v : . [ c s . C V ] M a r igure 2: We propose an approach to map feature vector ofa facial image whose length is a function of the iconicitySimilarity of an image with another is one such inherentproperty. In this work, we define iconicity as the verifia-bility of an image and propose a technique to use pairwisesimilarity for training a model to predict the verifiabilityof an image. This model is built as a Siamese Multi LayerPerceptron (MLP) network, which is trained using deepface descriptors extracted using a recognition network andoptimized using their similarity. It should be underscoredthat ‘verifiability’ of an image (also known as its iconicity)is defined as the distinct suitability of a facial image featureto be matched with any other image feature of the sameidentity, in the entire dataset. The training of SiameseMLP does not require any iconicity-based supervision, andonly needs identity labels. This is important as there existseveral face datasets with identity labels but no datasetswith iconicity-based annotation.During training, the model learns to map a given featureto iconicity scores by learning its interaction with those ofsame and different identities. Once the model learns thismapping, it can be used to predict the iconicity of imagesof unseen identities. After the model is trained, the mea-sure of iconicity is unary. At test time, we do not requireany explicit information such as identity labels, the globalknowledge of the dataset or any predefined reference im-ages and attributes. We can also generate different sets oficonicity scores corresponding to different sets of descrip-tors for training the Siamese MLP network.The contributions of our work are as follows: (1) Wepropose a novel method to relate the iconicity of a face im-age to verifiability of its descriptor. (2) We propose a sim-ple Siamese Multi Layer Perceptron architecture to estimatethe iconicity of a given facial image feature, without anyiconicity-based supervision. (3) We demonstrate that defin-ing iconicity in the aforementioned manner correlates wellwith factors that affect the visual quality of an image. (4)We establish a use-case for our iconicity scores by usingthem to pool features in template-based face verification,and obtain results comparable to state-of-the-art.The paper has been organized as follows: Previous workin this area is summarized in Section 2. We present our ap-proach in Section 3 and describe the experiments performed in Section 4. Experimental results are reported in Section 5.
2. Related work
Face iconicity (or image iconicity in general) has notbeen explored widely in previous research. In [3], the au-thors use compositionality of an image, and its similaritywith images of different object categories to select imagesthat are highly representative of the class. This is followedby clustering to filter out the most iconic images of an ob-ject category. But, the specified method requires explicitinformation about categories at test time.As mentioned earlier, face iconicity is related to facequality. We believe that an ideal face iconicity metric shouldcorrelate well existing image quality metrics. Before weproceed, it is important to highlight that image quality as-sessment (IQA) and face image quality prediction are dif-ferent tasks. For instance, a profile face image can be ofhigh quality, yet it might not represent the identity as wellas a frontal image can. Also, it should be noted that IQA isusually defined in the context of image compression. How-ever, due to the shared attributes in face quality predictionand general IQA, we discuss the relevant work in both ofthese areas. Existing literature on (facial) image quality canbe divided into three categories which are discussed below:
Techniques computing a specific attribute to measurequality : Previous experiments ([1], [25], [21]) in this fieldhave defined quality on the basis of a specific attribute of thefacial image, such as pose, blur etc. For example, the au-thors of [1] defined quality on the basis of contrast, bright-ness, focus, illumination and sharpness. The authors in [12]propose the BRISQUE algorithm based on the ‘naturalness’of the image, defined as the deviation of the normalized lu-minance distribution from its Gaussian estimation. How-ever, these approaches use a specific attribute or referenceimages to develop a notion of face quality, and hence cannotbe scaled to larger datasets with more variation and breadth.
Techniques which utilize explicit manually extractedquality specific information:
Several experiments andmost of the CNN/deep network-based quality assessmentframeworks require supervision with respect to qualityscore during training. However, the size of such datasetsis limited and construction of such databases is time-consuming and expensive. Moreover, there does not existsuch datasets for face quality. For example, in [8], theauthors proposed a no-reference deep CNN architecturewhich is trained on datasets (such as LIVE [22]), thatsupply image degradation level as ground truths. Similarly,the authors proposed a learning-to-rank technique in [6],where in weights (which indicate quality) are learnedfor image features according to their respective datasets,such that the predefined ranking of the quality of imagesin the datasets considered is maintained. However, theranking of the dataset quality needs to be explicit in thisigure 3: Overview of our proposed approach. Duringtraining, we compute the loss using cosine similarity cos α ,pair label y , which guides the growth/decay of r ( f ) & r ( f ) . f and f are extracted from a recognition network.During testing, we use one of the trained twins.work. In [10], the authors used a Siamese network to rankthe image features according to their quality. However,their approach explicitly required quality-based ranks fortraining. Moreover, as their framework is not specificallydefined for faces, the feature similarity cannot indicate ofthe pair-quality as there is no concept of identities. Descriptor/image-to-score techniques
One of the morerecent works [16], used the scores assigned to a face bya face detector (FD scores) as a measure of facial imagequality. In this work it has been established that weight-ing features of network proposed in [17] with these qual-ity scores helps to improve the performance of existingtemplate-based face verification systems. Another interest-ing work in this area is [13], where it was proposed thatthe norm of a facial image feature can be used as a qual-ity measure of the face. It was demonstrated that a lowernorm value is associated with low quality images. However,this is not applicable for normalized image features (suchas those obtained in [17], [24], [23] etc. where the networknormalizes the feature as part of the training process). Tothe best of our knowledge, [13] and [16] are the only worksfor face quality prediction (as opposed to general IQA),where computation of the quality scores is completely inde-pendent of reference images/attributes and/or does not relyon quality level ground truth during training. Hence, wepresent a technique in this category which directly predictsthe iconicity of an unseen facial image given its feature de-scriptor. This alleviates the dependence of the predictor onreference attribute or image, making the method scalableand generalizable.
3. Our Approach
We view iconicity as one of the indicating factors of thequality of an image. We hypothesize that for an identity consisting a mixture of iconic and non-iconic images, if afacial image cannot be matched successfully with imagesof the same identity, then it is a non-iconic image, i.e. ver-ifiability of that image is low. This information from thetraining dataset is indirectly used to optimize an objectivefunction, which is designed to enable the network estimatethe verifiability of a feature. Once the model learns to mapa feature to its verifiability, it can predict the same for anyfeature, irrespective of identity, during evaluation. We usedeep feature representations of faces instead of the raw im-age and propose an approach (depicted in Figure 3) to pre-dict the verifiability of any given image feature, without us-ing reference images/attributes.
On a hypersphere of images, the cosine similarity of animage feature pair provides an estimate of the angular sepa-ration between these images. However, in this hypersphere,the notion of length of feature vectors is lost as the cosinesimilarity is obtained by normalizing the feature vectors fol-lowed by their inner product. We present an approach whichmaps this representation to another hypersphere (shown inFigure 2) where the length of the feature vector f representsthe iconicity of the feature and is represented by a function r ( f ) , which is the tentative output of our model. For thismapping we need to normalize the feature f (cid:107) f (cid:107) and representit as r ( f ) . f (cid:107) f (cid:107) in the new hyperspace, where r ( f ) representsthe iconicity of the feature. Thus, the iconicity determinesthe length of the feature from the origin of the hypersphere.In order to learn r ( f ) , we use a pairwise learning techniquewhere we optimize (cid:104) f , f (cid:105) r , which represents the dot prod-uct of a feature pair in the new hypersphere ( r ( · ) ) space. (cid:104) f , f (cid:105) r = (cid:18) r ( f ) f (cid:107) f (cid:107) (cid:19) T (cid:18) r ( f ) f (cid:107) f (cid:107) (cid:19) = ⇒ (cid:104) f , f (cid:105) r = r ( f ) r ( f ) cos α To optimize this inner product in the new space we formu-late it as a new similarity measure which can take into ac-count the iconicity of individual features as well. So, wedefine an objective function for training the model as: L ( f , f ) = max(0 , y (∆ − r ( f ) r ( f ) cos α )) (1)where ∆( > represents the margin, which is a hyperpa-rameter. Specifically, we want the angle α between two im-age features f and f and the label y associated with thepair (+1 for positive pair i.e. both images from the sameidentity and -1 for negative pair i.e. if both images belongto different identity), to guide the growth/decay of r ( f ) and r ( f ) . We build a Siamese MLP network to learn the func-tion r ( · ) . While training, it takes in a pair of features f and f , and predicts their iconicity i.e. r ( f ) and r ( f ) . Initially,nterpretation tableType cos( α ) y ( ∆ − r ( f ) r ( f ) . cos α ) Effect on r ( f ) r ( f ) for optimizationI Relatively low ( < - Non iconic) Relatively high DecrementII (Disguise) Relatively high ( > - Non iconic) Relatively high DecrementIII Relatively high ( > - Iconic) Relatively low IncrementIV Relatively low ( < - Iconic) Relatively low IncrementTable 1: Tentative effect of cos α on the product ( r ( f ) r ( f ) ) r ( f ) is a random scalar but as the network is trained, for agiven feature f , r ( f ) is optimized in a way such that it cap-tures the verifiability of f with respect to the entire dataset,by taking into consideration its associated pair label and itsinteraction with other features. Hence the model is trainedto predict the verifiability of any given f , using angular sep-aration and pair labels. The numerical interpretation of ourloss function is provided in the next section. Without loss of generality, we can assume the presenceof the following four types of pairs in the training dataset:
Type-I : At least one unclean image in a positive pair. Sim-ilarity scores of such pairs are less than zero, even thoughthe associated images are of the same identity.
Type-II : At least one unclean image in a negative pair.Similarity scores of such pairs are positive, even though theassociated images belong to different identities, represent-ing disguise.
Type-III : Both clean images in a positive pair. Similarityscores of such pairs are positive, as expected.
Type-IV : Both clean images in a negative pair. Similarityscores of such pairs are negative, as expected.Table 1 illustrates the effect of α and the possible effect ofloss function on the product r ( f ) r ( f ) . It can be noticedthat the model is inclined to decrease this product if the paircontains at least one non iconic image. In the previous sub section, we inferred that the prod-uct r ( f ) r ( f ) would be decreased by the model if the paircontains at least one non-iconic pair. If the training datasetconsists of l iconic and m non iconic images, then a givennon iconic image feature f can be associated with at most l + m − pairs, each of them belonging to either Type I orType II. Similarly, a given iconic image can be associatedwith only m pairs belonging to Type I or II. Hence dur-ing training, the product r ( f ) r ( f ) involving a given noniconic image would be penalized more than that involvinga given iconic image since m (cid:28) l + m − i.e. the datasetconsists of a mixture of iconic and non-iconic images. Fromthis, we cannot directly deduce that the score of the non-iconic feature f would be penalized more. This is becausethe product can also be decreased by penalizing the score of an iconic image and increasing that of the non iconic im-age in such a way that the product is decreased. However,as presented in Table 1, the product should be maximizedwhen an iconic pair is encountered during training, to de-crease the loss. So, for penalizing a product consisting of anon iconic image feature, the score of the non iconic imagefeature would be needed to be decreased. Therefore, it canbe concluded that since a given non iconic image can be as-sociated with more non iconic pairs (as compared to iconicimage), it would be penalized more than a given iconic im-age. As explained in the previous subsection, in order to op-timize the network, the number of iconic images l andnon iconic images m should be chosen such that m <
4. Experiments
As explained in Section 3.4, we require a training datasetthat has a mixture of iconic and non iconic images for everyidentity. Hence we use Batch 1 of UMD Faces [2] whichsatisfies this requirement. This subset of the dataset consistsof 175,534 images for 3674 subjects. For testing, we choosedatasets with domains different from that of UMD faces.Since we show the correlation of our iconicity score withrespect to several parameters (like blur, pose etc.), we selectdatasets with appropriate annotation of these parameters.
We train a Siamese MLP network (Figure 4), using pairsfrom Batch 1 of UMD Faces dataset. A single twin of theSiamese network accepts a 512 dimensional descriptor, andhas 4 hidden layers, each consisting of 512, 256, 128, 64igure 4: One of the Siamese MLP twinshidden units. The first three hidden layers are followed bySeLU [9] activations. There exists full connection betweenoutputs these activation layers and the consequent hiddenlayer. The final hidden layer is then connected to the out-put node, which is scaled between 0 to 1 by using sigmoidunit. We assign a ground truth y ∈ { +1,-1 } on positive andnegative pairs. Hence, Eq. 1 is used as loss function. In ourexperiments, we chose ∆ = 0 . based on the distribution ofthe similarity scores of the training dataset. In order to demonstrate the robustness of our modelacross different set of features, we perform experiments us-ing features extracted using two separate network architec-tures. We use features of the architecture proposed in [17],which is trained using L constrained softmax loss, on MSCeleb 1M dataset [7]. All features generated using this net-work have unit norm. We also perform experiments withfeatures learned by the network proposed in [19], which fol-lows the AlexNet architecture. This network is trained onthe CASIA-Webface dataset [27]. Thus, we train the fol-lowing two Siamese MLP models, using two set of featuredescriptors:a.) Model-1 : Siamese MLP trained with features of [17]b.)
Model-2 : Siamese MLP trained with features of [19].The aforementioned features are used as their correspond-ing networks have demonstrated high performance for thetask of face verification. Also, in all our experiments, fea-tures are extracted after computing face coordinates usingthe all-in-one network described in [18].
For every epoch, we randomly choose 20000 positiveand 20000 negative pairs from the dataset, feed the fea-ture pair to the network and perform mini batch gradientdescent using a batch size of 256. We train the network for50 epochs. We train two models, using features of the net-work in [17] and [19]. For testing we only use one of thetwins of the trained model to determine the iconicity of anygiven image feature.
5. Results
An ideal face quality/iconicity metric is expected tobe an all-in-one metric which obtains considerable perfor-mance on a variety of quality related tasks, and correlateswell with human quality opinions. We evaluate face qualitymetrics on the basis of their performance on the followingtwo tasks.
Relation with factors that affect face quality:
To ver-ify the usefulness of any quality score, we observe its cor-relation with some of the factors that visibly affect facialimage quality : yaw, pitch, roll, occlusion and blur. Wealso demonstrate that our Siamese models generate iconic-ity scores well correlated with human perception of quality.
Template-based face verification : In [5] and [16],template-based face verification has been used as one of theapplications of face quality measure. Hence, we also eval-uate the effectiveness of our iconicity scores for this task.We use iconicity scores as weights to perform quality pool-ing for template-based face verification, and compare ourverification results with that obtained with media averagingtechnique to combine features of the same template.As mentioned in [15], computing quality using a biomet-ric system is a Biometric complete problem, which basi-cally implies that our computed iconicity score cannot im-prove the one-to-one verification algorithm. However, intemplate-based face verification, the objective is to reducethe error rate of the system (and not to remove the weaknessof algorithm). Thus, we expect our scores to help us appro-priately weight samples in a template.To properly evaluate our iconicity scores, a fair comparisonwith existing face quality metrics as well as a general IQAmetric is required. As mentioned in Section 2, FD score(i.e. scores assigned to a face by a face detector, effectivelyimplying faceness) and norm of features are the only qual-ity scores which do not require any reference image duringtraining and evaluation. Therefore, we compare the qualityscores of Model-1 and Model-2 with: a.) Face Detectionscores [16], b.) Norm of features in [19] and c.)BRISQUE[12] (an existing general IQA metric), for showing corre-lation with respect to affecting factors in most of the ex-periments. We cannot use Norm of [16] as these featureshave unit norm. For the task of template based face veri-fication, we compare the verification results obtained afterquality pooling with scores of Model-1 and Model-2, withFD Score and Norm of [19]. Finally we compare the per-formance breadth of each of these metrics.
Firstly, we visualize the images according to Model-1scores. Figure 5 confirms that the scores accurately capturethe visual quality of a facial image. a) (b) (c) (d)
Figure 5: Images with low Model-1 iconicity score (i.e. r ( f ) ) in a.) Batch 2 of UMD Faces [2], b.) IJB-C dataset [11] andimages with high Model-1 iconicity score in c.) Batch 2 of UMD Faces, d.) IJB-C dataset. (a) (b) (c) Figure 6: Variation of scores obtained with Model-1, Model-2, FD Score [16], norm of [19] and BRISQUE [12] across a.)Yaw, b.) Pitch, c.) Roll, on IJB-C dataset.
The quality score of the facial image is expected to de-crease as the pose of the face becomes extreme. We definepose on the basis of yaw, pitch and roll. For evaluation, wecompute the values of these parameters for images in IJB-C dataset [11] using the all-in-one ConvNet trained in [18].The IJB-C dataset consists of 3531 identities with a total of31,334 still images and 117,542 video frames collected inunconstrained settings. Following this we bin these valuesin a way such that each bin approximately contains equalnumber of images. For each bin, we compute the aver-age quality score and visualize their variance as the posekeeps on becoming extreme. It is worth noting that whileanalyzing one pose parameters, the other two parametersare constrained to be between − ◦ to ◦ . It can be con-firmed from Figure 6 that the Siamese MLP iconicity scores(Model-1 and Model-2) correlate well with pose variation.It should be emphasized that even though the model wasnot given any explicit information about facial poses, it isable to capture the relation of pose and quality on a datasetwith a domain, completely different from that of the trainingdataset. On the other hand, BRISQUE [12], being a generalIQA metric, shows no correlation with face pose. Also fromFigure 6, it is clear that the FD score does not correlate wellwith roll. The quality score of the facial image would decrease asthe image gets degraded due to blur. To experimentally ver-ify this using the Siamese MLP iconicity scores, we evalu-ate the scores on images in the WIDER Face dataset [26].We use the training split of this dataset as we have groundtruth annotation for blur of the image. The training im-ages have been labeled to have one of the three blur la-bels : 0 - for no blur. 1 - for partial blur and 2 - for ex-treme blur. We plot the distribution of images belongingto each of these blur levels with respect to the score ob-tained from the Siamese MLP. We chose images with noextremes in terms of other parameters (such as occlusion,illumination, expression etc.). We test the score obtainedby Model-1 and Model-2. It can be seen from Figures 7(a)and (b) that the average quality score (represented by thevertical line) keeps decreasing as the amount of blur is in-creased. In addition, the model is able to assign somewhatseparate score distribution to different levels of blur. Here,the iconicity models were not provided with any explicitinformation about blur while being trained. We also pro-vide corresponding plots using three other quality scores :Face Detection (FD) score [16], norm of the feature [19] andBRISQUE [12]. We can infer that while norm of the featuremodels the blur quite well, the distributions obtained usingBRISQUE and FD score are not resolvable. a) (b)(c) (d)(e) (f)(g) (h)(i) (j)
Figure 7: (a-e)
Distributions of quality scores on differentlevels of blur obtained using a.) Model-1 b.) Model-2 c.)Face detection scores [16], d.) Norm of [19] e.) BRISQUE. (f-j)
Distributions of quality scores on different levels ofocclusion obtained using f.) Model-1 g.) Model-2 h.) Facedetection scores [16] i.) Norm of [19] j.) BRISQUE Figure 8: We perform face verification by pooling featureswith various quality scores, on IJB-C dataset
Features used Scores for quality pooling L constrained softmax loss [17] FD Score L constrained softmax loss [17] None (Media average)AlexNet [19] Feature NormAlexNet [19] None (Media average) L constrained softmax loss [17] Model-1AlexNet [19] Model-2 Table 2: Combination of features and quality poolingscores, used in this experiment
Method / FPR: − − − − − − Facenet [20] 20.95 33.30 48.69 66.45 81.76 92.45VGGFace [14] 32.20 43.69 59.75 74.79 87.13 95.64Norm of [19] 47.2 61.9 74.1 84.2 91.8 97.3FD Score on [17] 66.2 85.9 91.9
Model-1 (Ours)
Table 3: True positive rates for different face verificationmethods on the IJB-C data set.
Facial image quality can be degraded due to possible oc-clusions. To corroborate the universality of the SiameseMLP scores, we observe the variation of the score with re-spect to occlusion and compare it with the score-variationof FD score and feature norm. We use the training splitof WIDER dataset as we have ground truth annotation forocclusion of the image. Figure 7(f-j) presents the distribu-tions for different levels of occlusion modeled by Model-1, Model-2, Norm of [19] and Face Detection scores [16].Once again it is clear from Figure 7(h) and (j) that FD scoreand BRISQUE do not correlate well with occlusion.
After comparing the correlation of iconicity/quality met-rics with various factors, we perform face verification bypooling features with various quality scores (including thescores of Model-1 and Model-2). The combination of fea-tures and corresponding pooling weights is mentioned inTable 2. These experiments are performed on the IJB-Cdataset, using the verification protocol specified in [11].he verification protocol includes 19557 genuine matchesand 15,638,932 impostor matches, which allows us to eval-uate the performance at very low FARs of − . The algo-rithm used for quality pooling is same as in [16]. Given fea-ture f i in a template of L images and corresponding qual-ity score r i from a given model (Model-1 or Model-2), wecompute q i = e λri (cid:80) Lj =1 e λrj where λ is a hyperparameter. Weempirically chose λ = 0 . in our experiments, based on theverification performance on the held-out data. Followingthis, we use q i to weight feature f i in a given template. Weobtain the final feature descriptor as f = (cid:80) Li =1 q i f i , anduse it for verification. It is clear from Figure 8 that Model-1and Model-2 perform better than media averaging of fea-tures and norm of features in [19], especially at low FARs.Model-1’s performance, is especially comparable to that ofquality pooling features of [17] with FD scores. Moreover,it is observed that features of [17] outperform AlexNet fea-tures [19] in general. Hence we pool features of [17] usingscores of Model-1 (our best performing model), FD scoresand media averaging (which is our baseline) and comparetheir respective verification results in Table 3. Clearly, theiconicity scores from Model-1 outperform Norm of [19] andmedia averaging of features of [17], Facenet [20] and VG-GFace [14]. We now discuss the verification results obtained with ouriconicity scores and analyze the difference between the re-sults obtained with Model-1 and Model-2. Our approach totrain iconicity models is dependent on the qualitative infor-mation of facial quality encoded in the feature. Quality-richfacial features would help to learn a better iconicity model,trained with such features. Hence we perform a small exper-iment to compare the quality information of faces present inthe features of [17] and [19]. In this experiment we comparethe relative information of facial yaw in these features, sinceyaw is an important facial attribute that affects verifiability(and hence the iconicity) of a face.We randomly select 1000 features using [17] and [19]and divide them into training and testing split (60% and40% respectively). For this data, we also compute the yawsusing [18]. We then train a linear regression model usingthe training data, to predict facial yaw. Finally, we comparethe relative error of the linear regressor trained with [17]and [19] to estimate the amount of yaw information con-tained in these features. The regression errors are providedin Table 4. Clearly, the error of the regression model trainedwith [19] is more than that trained with [17]. Hence we be-lieve that features of [17] encapsulates much more yaw (andhence quality) information than [19]. This explains the su-periority of the iconicity model trained with [17] over thattrained with [19].
Features used for linear regression Test error L constrained softmax loss [17] 0.71AlexNet [19] 0.84 Table 4: Errors of linear regression models while predictingfacial yaw
Method : Y P R Blur Occl n Verific n UniversalBRISQUE [12] (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) - (cid:51) FD Score (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51)
Norm of [19] (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55)
Ours (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
Table 5: Performance breadth of different face qualitymetrics across various tasks : Correlation with Yaw (Y),Pitch(P), Roll(R), Blur, Occlusion, Quality Pooling for faceverification, Universality
6. Performance breadth
For any ideal face quality metric, it is the performancebreadth (rather than depth in certain tasks) and universal-ity that demonstrates the efficiency of the metric. We findthat BRISQUE (which is a general IQA metric) is outper-formed by all other methods. Citing the difference betweenIQA and face quality prediction, this outcome is expected.We find that FD scores can be effective to perform templatebased face verification, but it does not correlate well withblur, occlusion and roll (see Figures 6 and 7 ). Also, we findthat the norm of the features correlates well with yaw, pitchand other factors, but performs poorly when used for tem-plate based face verification, especially at low FARs (Table3). Moreover, the norm cannot be used as a face qualitymetric if the features which have uniform norms. Hence, itis not universal. Interestingly, our iconicity scores (Model-1/Model-2) correlate well with all factors affecting the facequality and also obtains verification results comparable tostate of the art. Thus, as the Siamese MLP scores demon-strate maximum performance breadth and is universal, it isclosest to an ideal quality metric, among existing metrics.The breadth results are summarized in Table 5.
7. Discussion
In this work, we proposed a data driven approach to learnthe iconicity of an image feature without the use of a pre-defined set of quality indicating images, or any external re-source to aid the training of the model. As iconicity impliesquality, we observe the variation of our model scores withrespect to factors that affect the quality of an image. Finally,we use our scores to weight the features for template-basedface verification. Our scores outperform the media aver-aging technique for the same and shows improvement overthat achieved by scores obtained directly from a face detec-tor. cknowledgement
This research is based upon work supported by the Of-fice of the Director of National Intelligence (ODNI), In-telligence Advanced Research Projects Activity (IARPA),via IARPA R&D Contract No. 2014-14071600012. Theviews and conclusions contained herein are those of the au-thors and should not be interpreted as necessarily represent-ing the official policies or endorsements, either expressedor implied, of the ODNI, IARPA, or the U.S. Government.The U.S. Government is authorized to reproduce and dis-tribute reprints for Governmental purposes notwithstandingany copyright annotation thereon.
References [1] A. Abaza, M. A. Harrison, and T. Bourlai. Quality met-rics for practical face recognition. In
Pattern Recognition(ICPR), 2012 21st International Conference on , pages 3103–3107. IEEE, 2012.[2] A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan, andR. Chellappa. Umdfaces: An annotated face dataset for train-ing deep networks. In
Biometrics (IJCB), 2017 IEEE Inter-national Joint Conference on , pages 464–473. IEEE, 2017.[3] T. L. Berg and A. C. Berg. Finding iconic images. In
Computer Vision and Pattern Recognition Workshops, 2009.CVPR Workshops 2009. IEEE Computer Society Conferenceon , pages 1–8. IEEE, 2009.[4] T. L. Berg and D. Forsyth. Automatic ranking of iconic im-ages.
University of California, Berkeley, Tech. Rep , 2007.[5] L. Best-Rowden and A. K. Jain. Automatic face image qual-ity prediction. arXiv preprint arXiv:1706.09887 , 2017.[6] J. Chen, Y. Deng, G. Bai, and G. Su. Face image quality as-sessment based on learning to rank.
IEEE signal processingletters , 22(1):90–94, 2015.[7] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m:A dataset and benchmark for large-scale face recognition. In
European Conference on Computer Vision , pages 87–102.Springer, 2016.[8] L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutionalneural networks for no-reference image quality assessment.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1733–1740, 2014.[9] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter.Self-normalizing neural networks. In
Advances in NeuralInformation Processing Systems , pages 972–981, 2017.[10] X. Liu, J. van de Weijer, and A. D. Bagdanov. Rankiqa:Learning from rankings for no-reference image quality as-sessment. In
Computer Vision (ICCV), 2017 IEEE Interna-tional Conference on , pages 1040–1049. IEEE, 2017.[11] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller,C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney,et al. Iarpa janus benchmark–c: Face dataset and protocol.In , 2018.[12] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-referenceimage quality assessment in the spatial domain.
IEEE Trans-actions on Image Processing , 21(12):4695–4708, 2012.[13] C. J. Parde, C. Castillo, M. Q. Hill, Y. I. Colon, S. Sankara-narayanan, J.-C. Chen, and A. J. OToole. Face and imagerepresentation in deep cnn features. In
Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE Interna-tional Conference on , pages 673–680. IEEE, 2017.[14] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep facerecognition. In
British Machine Vision Conference , 2015.[15] P. J. Phillips and J. R. Beveridge. An introduction tobiometric-completeness: The equivalence of matching andquality. In
Biometrics: Theory, Applications, and Sys-tems, 2009. BTAS’09. IEEE 3rd International Conference on ,pages 1–5. IEEE, 2009.[16] R. Ranjan, A. Bansal, H. Xu, S. Sankaranarayanan, J.-C. Chen, C. D. Castillo, and R. Chellappa. Crystal lossand quality pooling for unconstrained face verification andrecognition. arXiv preprint arXiv:1804.01159 , 2018.[17] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrainedsoftmax loss for discriminative face verification. arXivpreprint arXiv:1703.09507 , 2017.[18] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chel-lappa. An all-in-one convolutional neural network for faceanalysis. In
Automatic Face & Gesture Recognition (FG2017), 2017 12th IEEE International Conference on , pages17–24. IEEE, 2017.[19] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chel-lappa. Triplet probabilistic embedding for face verificationand clustering. In
Biometrics Theory, Applications and Sys-tems (BTAS), 2016 IEEE 8th International Conference on ,pages 1–8. IEEE, 2016.[20] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: Aunified embedding for face recognition and clustering. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 815–823, 2015.[21] H. Sellahewa and S. A. Jassim. Image-quality-based adaptiveface recognition.
IEEE Transactions on Instrumentation andmeasurement , 59(4):805–813, 2010.[22] H. Sheikh. Live image quality assessment database release2. http://live. ece. utexas. edu/research/quality , 2005.[23] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2hypersphere embedding for face verification. In
Proceedingsof the 2017 ACM on Multimedia Conference , pages 1041–1049. ACM, 2017.[24] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou,and W. Liu. Cosface: Large margin cosine loss for deep facerecognition. arXiv preprint arXiv:1801.09414 , 2018.[25] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-celli. Image quality assessment: from error visibility tostructural similarity.
IEEE transactions on image process-ing , 13(4):600–612, 2004.[26] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: Aface detection benchmark. In
IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[27] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-tation from scratch. arXiv preprint arXiv:1411.7923arXiv preprint arXiv:1411.7923