Beyond Identity: What Information Is Stored in Biometric Face Templates?
Philipp Terhörst, Daniel Fährmann, Naser Damer, Florian Kirchbuchner, Arjan Kuijper
BBeyond Identity: What Information Is Stored in Biometric Face Templates?
Philipp Terh¨orst , Daniel F¨ahrmann , Naser Damer , Florian Kirchbuchner , and Arjan Kuijper Fraunhofer Institute for Computer Graphics Research IGD, 64283 Darmstadt, Germany Department of Computer Science, Technical University of Darmstadt, 64289 Darmstadt, Germany
Abstract
Deeply-learned face representations enable the successof current face recognition systems. Despite the ability ofthese representations to encode the identity of an individual,recent works have shown that more information is storedwithin, such as demographics, image characteristics, andsocial traits. This threatens the user’s privacy, since formany applications these templates are expected to be solelyused for recognition purposes. Knowing the encoded infor-mation in face templates helps to develop bias-mitigatingand privacy-preserving face recognition technologies. Thiswork aims to support the development of these two branchesby analysing face templates regarding 113 attributes. Ex-periments were conducted on two publicly available faceembeddings. For evaluating the predictability of the at-tributes, we trained a massive attribute classifier that isadditionally able to accurately state its prediction confi-dence. This allows us to make more sophisticated state-ments about the attribute predictability. The results demon-strate that up to 74 attributes can be accurately predictedfrom face templates. Especially non-permanent attributes,such as age, hairstyles, haircolors, beards, and various ac-cessories, found to be easily-predictable. Since face recog-nition systems aim to be robust against these variations, fu-ture research might build on this work to develop more un-derstandable privacy preserving solutions and build robustand fair face templates.
1. Introduction
The advances of deep neural representations lead tohigh-performing face recognition solutions [8]. Due tothe achieved performance, face recognition systems spreadworld-wide and increasingly affect our daily life [4]. De-spite that these face representations are trained to en-able recognition of individuals, previous works showedthat more information than just the identity are embed-ded. They demonstrated face templates contain informa-tion about head pose [30], image characteristics (such asquality [1, 11], viewpoint [12], and illumination [28]), de- mographics [5, 37, 29], and social traits [31]. However, formany applications, the users do not permit to have accessto this information. Thus, the stored data should be ex-clusively used for recognition purposes [25], and extractingsuch information without a person’s consent is considereda violation of their privacy [17]. This problem is known assoft-biometric privacy [25] and solutions are either build onimage- [27, 23, 24] or template-level [34, 35, 40, 2].Since the knowledge about encoded attributes in facetemplate is required to develop more advanced bias-mitigating solutions [7, 21, 41, 38, 42] and more compre-hensive privacy-enhancing technologies, in this work we in-vestigate the predictability of 113 attributes from face tem-plates at different difficulty-levels. We jointly trained a mas-sive attribute classifier (MAC) with a high number of at-tributes to take advantage of a shared feature space. TheMAC is modified such that it is able to accurately state itsprediction reliability [37]. This allows us to make predic-tions at two reliability levels and thus, to derive more fine-grained statements about the predictability of attributes inface templates. The experiments were conducted on twopublicly available databases, CelebA [22] and LFW [13],and on two popular face embeddings, FaceNet [32] and Ar-cFace [6]. To derive understandable statements about thestored attribute information, we categorized each attributeinto on of three predictability classes: easily-predictable,predictable, and hardly-predictable. The results shows that39 attributes are assigned to the easily-predictable class and74 of the 113 investigated attributes are at least predictable.Despite that face templates are learned to be robust to non-permanent factors, the results demonstrate that especiallythese attributes are easily-predictable. This includes infor-mation about age, hairstyles, haircolors, beards, and acces-sories, such as makeup, lipstick, and glasses.
2. Related work
The development of deep neural network representationsfor faces led to strong performance boosts for face recogni-tion [8]. However, since these representations are derived a r X i v : . [ c s . C V ] S e p rom black-box models, it is not clear which kind of infor-mation is stored in these representations.In 2017, Parde et al. [30] investigated face representa-tions in terms of head position and source of the image.The results demonstrated that the investigated representa-tions contain accurate information of the yaw and pitch ofa face and about whether the input-face origins from a stillimage or a video frame. They suggest that image-quality in-formation might available in these features as well. This hy-pothesis was proofed to be correct [39, 1, 11]. In [39, 1, 11],face image quality was successfully predicted based on faceembeddings.In [31], Parde et al. analysed if face representationsretain information in faces that supports social-trait infer-ences. In their experiments, they investigated 11 socialtraits such as talkative, assertive, shy, quiet, warm, artis-tic, efficient, careless, impulsive, anxious, and lazy. Theytrained linear classifiers to predict these human-assigned so-cial trait profiles and demonstrated that these traits can bedetermined from face embeddings to a high degree. Thebest-predicted traits were impulsive, warm, and anxious.Hill et al. [12] analysed the representations of carica-ture faces. They examined the organization of viewpoint(0, 20, 30, 45, 60), illumination (ambient vs spotlight),gender (male vs female), and identity in the embeddingspace. Their results showed that the utilized face recog-nition model creates a highly organized, hierarchical, simi-larity structure in which information about face identity andimaging characteristics coexist. These results were summa-rized by O’Toole et al. [28]. They reviewed what propertiesare known about the face space and ground them in the con-text of previous-generation face recognition algorithms.In [43, 44] Zhong et al. demonstrated that the use ofvarious mid-level representations from face recognition net-works leads to highly accurate facial attribute estimationperformances. This indicates that also high-level repre-sentations, such as face recognition templates, might con-tain a significant amount of facial attribute information. In[5, 37, 3, 29], it is shown that demographic attributes suchas gender, age, and race can be derived from face templates.So far, previous work showed that head pose, imagecharacteristics (such as quality, source of the image, view-point, illumination), demographic attributes (gender, age,race), and social traits (e.g. impulsive, warm, and anxious)can be found in face templates.In contrast to previous work that investigated only spe-cific characteristics, in this work we analyse a wide range ofattributes (up to 113) in face representations. Moreover, weanalyse the predictability of these attributes under differentlevels of prediction reliabilities. This allows us to state moregenerally which attributes are encoded in face templates.
3. Investigation methodology
This work aims at analysing the set of soft-biometric in-formation that is stored in face templates. To do so we traina classifier to jointly predict these attributes. If the classi-fier can successfully predict these, we conclude that theseattributes are stored in the face templates. However, thisonly allows us to answer the question of what informationis embedded. A statement about what information is not in-cluded is not possible, because the reverse conclusion is notnecessarily logical. If an estimator is not able to learn thepattern of an attribute, it does not imply that the pattern doesnot exist. The classifier might just not be able to deal withthe complexity of the attribute pattern or the data variabilityand representation might be low.To answer the research question of this work, the fol-lowing three subsections explain the different steps of theinvestigation methodology. In Section 3.1, we will first ex-plain the classifier training procedure that allows a joint pre-diction of a large number of attributes. Learning these at-tributes in a multi-task learning approach will enhance theperformance, since many attributes share similar features.In Section 3.2, we explain how this classifier can accuratelystate its predictions confidence. This prediction confidencedetermines the quality of a prediction and enables us to de-rive predictability classes in Section 3.3. These predictabil-ity classes allow us to generalize our findings into easilyunderstandable statements.
To investigate what attribute-information is stored inface templates, we train a classifier model to predict mul-tiple attributes. If the classifier can correctly predict theseattributes given face templates, we can draw conclusionsabout what attributes are encoded in the investigated repre-sentation.Therefore, we trained a neural network model to jointlypredict multiple attributes given face templates of the train-ing set. Due to the large number of predicted attributes, werefer to this model as the massive attribute classifier (MAC).To find an optimal network structure for our MAC, we eval-uated multiple models with various number of dense layersand layer sizes. To be precise, we evaluated random net-work structures with 1-3 initial layers and 1-3 branch layersthat connects the last initial layer with the the softmax lay-ers of each attribute. For each layer a size of 128, 256, and512 was evaluated. We choose the structure with the moststable results as the layout of our MAC. However, despitethe large variations in the investigated network structures,we observed that, in most cases, the predicted performanceper attribute only varies within a range of 1-2%.The chosen MAC-network consists of two initial layers,the input layer of size n in and the second dense layer of size512. Here, n in refers to the size of the utilized face embed-ing. Starting from the second layer, each attribute a has anown branch consisting of two additional layers of size 512and n ( a ) out , where n ( a ) out refers to the number of classes perattribute. Each layer has a ReLU activation, except for theoutput-layers, which have softmax activations. Moreover,Batch-Normalization [14] and dropout [33] with a dropout-probability of p drop = 0 . is applied to every layer. Thedropout allows to generalize the performance, but also en-ables us to derive reliability statements about the predic-tions (described in Section 3.2). The training of the MACwas done in a multi-task learning fashion by applying a cat-egorical cross-entropy loss for each attribute branch and usean equal weighting between each of these attribute-relatedlosses. For the training, an Adam optimizer [18] was usedwith e = 200 epochs, an initial learning rate α = 10 − , anda learning-rate decay of β = α/e . These parameter choicesare guided by [37]. The batch size b was chosen accordingto the amount of data available, b = 1024 for CelebA and b = 16 for LFW. To derive statements about the predictability of an at-tribute in a face template, we use prediction reliabilities tosimulate close-to-optimal classifier circumstances. There-fore, we follow the methodology in [37, 36] to enable ourMAC to state its prediction confidence (reliability). Fol-lowing this approach, we trained the MAC with dropout. Toderive a reliability statement additionally to an attribute pre-diction, m = 100 stochastic forward passes are performed.In each forward pass, a different dropout-pattern is applied,resulting in m different softmax outputs v ( a ) i for each at-tribute a . Given the outputs of the m stochastic forwardpasses of the predicted class ˆ c denoted as x ( a ) = v ( a ) i, ˆ c , thereliability measure is given as rel ( x ( a ) ) = 1 − αm m (cid:88) i =1 x ( a ) i − αm m (cid:88) i =1 m (cid:88) j =1 | x ( a ) i − x ( a ) j | , with α = 0 . , following the recommendation in [37]. Thefirst part of the equation is a measure of centrality and uti-lizes the probability interpretation of the softmax output. Ahigher value can be interpreted as a high probability thatthe prediction is correct. The second part of the equation isthe measure of dispersion and quantifies the agreement ofthe stochastic outputs x . In [37], this was shown to be anaccurate reliability measure.We use this reliability measure to simulate more idealis-tic circumstances. For each attribute, we calculate the pre-diction and corresponding reliability of each instance. Thenwe take the predictions of 100% and 50% of the highestreliabilities to evaluate the performance. This performancerefers to the ratio of considered predictions (RCP) of 100%and 50%. The performance at 100% RCP refers to the gen- eral performance of the whole dataset. The performance at50% RCP refers to the performance on the predictions with50% of the highest reliabilities. Consequently, this refersto the performance based on the prediction on which theMAC is most confident about. The unconsidered 50% ofthe predictions might contain factors of variances (such asblur, non-frontal head poses) that lead to unstable, and thusinaccurate, attribute estimates. To derive more understandable statements about whichattribute information is stored in a face template, we cate-gorize each attribute into one of three predictability classes: • Easily-predictable (++): an attribute is categorized aseasily-predictable if, and only if, the balanced accu-racy at 100% RCP is above 90%. This means thathighly accurate predictions are possible even undernon-ideal circumstances such as bad illuminations andnon-frontal head poses. • Predictable (+): an attribute is categorized as pre-dictable if, and only if, the balanced accuracy at 100%RCP is under 90%, but the balanced accuracy at 50%RCP is above 90%. This indicates that highly accuratepredictions are possible under close-to-optimal condi-tions , since it only takes into account 50% of the mostconfident MAC predictions. • Hardly-predictable (0): an attribute is categorized ashardly-predictable if the balanced accuracy is below90% at both, 100% and 50% RCP. Even under close-to-optimal circumstances, the MAC is not able to reachhigh accuracies . Consequently, the attribute patternsmight be too complex for the MAC to handle or it doesnot exist a meaningful pattern for this attribute.While the first two categorizes (Easily-predictable andPredictable) allow making confident statements about theamount of attribute information in face templates, the samedoes not apply for the third category (Hardly-predictable).The last category only states that the classifier is not ableto accurately learn the pattern, but this might be due to sev-eral reasons: (1) the pattern does not exist, (2) the patterndoes exist, but it is too complex for the model to learn, or(3) the pattern does exists but the amount of data and itsrepresentation is not appropriate for the classifier to learn.Consequently, for the third case, we can not determine if theattribute pattern exists.
4. Experimental setup
For the analysis of the face space, we chose the LabeledFaces in the Wild (LFW) [13] and the CelebFaces Attributesigure 1: Sample images from CelebA (top row) and LFW (bottom row)(CelebA) [22] datasets because of their large and rich at-tribute annotations. The large number of different soft-biometric labels allows to deeply investigate which of theseattributes are encoded in face templates. Figure 1 showssample images from both datasets. The CelebA dataset [22]is a large-scale dataset with more than 200k images of over10k celebrities. It covers large variations in pose and back-ground. Moreover, each image is labelled with 40 binaryattributes. LFW [13] contains over 13k images from over5k individuals and exhibits variability in pose, lighting, fo-cus, resolution, facial expression, age, gender, race, acces-sories, make-up, occlusions, background, and photographicquality. The face images are 250x250 pixels and mostly incolor. Each image is annotated with up to 73 attributes.The attribute labels of both databases [13, 22] cover awide range of characteristics such as the person’s demo-graphics, skin, hair, beard, face geometry, periocular area,mouth, nose, accessories, and environment.
In contrast to CelebA, where the attribute labels are ofbinary nature, in LFW, the labels come from the predictionprobabilities of a binary classifier [13]. Each label valuemeasures the degree of the attribute and thus, are continu-ous [19, 20]. E.g. for the attribute male, a higher label scoreindicates that the person appears more masculine than a per-son with a lower label score. Consequently, the top rankimages for an attribute represent the label true, while thelowest rank images indicate the label false. A value aroundzero means that the corresponding attribute has little mean-ing on this image.To make sure that our MAC performs well when trainingon LFW, we manually converted the continuous attributelabels to binary labels. Therefore, we assigned an upperand lower score threshold for each attribute. Images with ascore over the upper threshold are assigned as true, imageswith a score under the lower threshold are assigned as false,images with scores within the range are assigned as unde-fined. The upper and lower thresholds for one attribute aremanually determined by moving potential thresholds awayfrom zero. At each potential threshold, ten images with theclosest attribute scores are investigated. Here, the original LFW labels of the images are manually investigated for cor-rectness. If only eight or fewer attributes are investigated ascorrect, the potential threshold is further moved away fromthe starting point and the procedure is repeated. If a poten-tial threshold returns images with 9 or more correct labels, itis chosen as the limit. Repeating this over all attributes willresult in a lower and an upper threshold for each of theseattributes. By binaryzing the scores with these upper andlower thresholds, we ensure an error-minimizing data basisof the MAC. This allows us to train and test on meaningfuland correctly labelled data.Please note that the label-cleaning process reduces theamount of used labels by 51,7% that might induce a biasin our evaluation. To avoid biased conclusions that mightresult from this process, we evaluate on another binary la-belled database. After the label-cleaning, we found 15 at-tribute labels of either a low number of positively and neg-atively labelled samples ( < In this work we derive what information is contained inthe face templates based on prediction accuracies. In ma-chine learning, accuracy is defined by the ratio of the num-ber of correct predictions to the total number of predictions[26]. To be robust to attribute-imbalances, we report theprediction performance in terms of balanced accuracy. Thisrefers to the standard accuracy with class-balanced sampleweights [16].The train/test data is defined by dividing the databasesin a 70%/30% subject-exclusive split. To analyse the pre-diction performance of an estimator under more ideal cir-cumstances, we chose a classifier for the attribute predic-tion task that is additionally able to accurately state its pre-diction confidence. For each face template, this classifierpredicts the associated attributes and their prediction relia-bilities. To get the prediction performance under more idealcircumstances, for each attribute, only the predictions with50% of the highest reliabilities are considered for the bal-able 1: Train/test sample distribution on LFW for selectedattributes that are found insufficient for a meaningful at-tribute analysis after label-cleaning . Pos and Neg refers tothe number of positively and negatively labelled samplesfor the train and test set. The listed 15 attributes are foundto be insignificant for the analysis due to a low number ofsamples in either the positive or negative class.Train TestAttribute Pos Neg Pos NegColor Photo 8806 29 3772 24Mouth Slightly Open 674 109 315 57Round Face 9 588 3 250Goatee 20 3346 10 1557Baby 23 9137 15 3913Bangs 89 5238 44 2080Bald 114 4413 47 1953Big Lips 101 751 48 318Sunglasses 74 8583 50 3631Partially Visible F. 124 1501 55 601Mouth Wide Open 107 6593 56 2925Double Chin 154 172 57 136Harsh Lighting 113 914 62 487Outdoor 173 510 63 243Teeth Not Visible 125 2209 66 1089anced accuracy. This balanced accuracy refers to a ratioof considered predictions (RCP) of 50%. Since this relatesto the MAC prediction confidence, the balanced accuracyshould be higher at lower RCP-levels.
In this work, we utilize two widely-used face recognitionmodels, FaceNet [32] and ArcFace [6]. We use pre-trainedmodels trained on the MS1M database [10] for both net-works, FaceNet and ArcFace . To get the face templatefor a given face image, the image has to be aligned, scaled,and cropped. For FaceNet, the preprocessing is done as de-scribed in [15]. For ArcFace, we follow the preprocessingas described in [9]. The preprocessed image is passed toa face recognition model to extract the embeddings. Theoutput size is 128 for FaceNet and 512 for ArcFace. This works aims at understanding what kind of soft-biometric information is stored in face templates. There-fore, our investigations are divided into three parts:1. We validate the attributes labels of both datasets bystudying the correlations between the attributes. https://github . com/davidsandberg/facenet https://github . com/deepinsight/insightface
2. We analysing what attributes are contained in face rep-resentations by investigating the attribute predictionperformances on both datasets and face embeddings.To get a more complete perspective on the problem,the prediction performances on different confidence-levels of the classifier are investigated.3. We obtain an overview of which kind of informationis encoded in face templates by categorizing each at-tribute into one of three predictability classes based ontheir two-level prediction performances.
5. Results
This section is divided into three subsections, each fo-cusing on one investigation point: (1) analysis of the at-tribute correlation, (2) investigation of the attribute pre-dictability, and (3) summarize findings.
To understand the quality of the labels and potential bi-ases in the attribute labels, Figure 2 shows a selection ofattribute-label correlations. The attributes are chosen toshow the 15 most positive and negative pairwise correla-tions. For CelebA, the correlation in Figure 2a shows thatthe large majority of male faces in the database do not wearlipstick, earrings, and makeup. These attributes mostly be-long to female faces. Moreover, it shows some biases in thedatabase labels. The majority of male faces have a beard. Ifa face is labelled as attractive , it belongs to a young femaleface most likely wearing accessories and makeup. However,this figure also approves the quality of some labels. E.g.
NoBeard negatively correlates with all kinds of beards such as
Sideburns , Goatee , and
Mustache .Figure 2b shows the attribute correlation for LFW. Itshows that the attributes
Heavy Makeup , Wearing Lip-sticks , Wearing Earrings , and
Wearing Necklace belongstogether with
Youth and
Attractive Woman , Smiling , and
High Cheekbones . Moreover, this set of attributes does notcorrelate with a
Receding Hairline and
Male . Neverthe-less, it also approves the quality of other labels such as
NoEyewear (negatively correlates with
Eyeglasses ) and
CurlyHairs (negatively correlates with
Straight Hair ). To derive statements of which attributes are encoded inface templates, the prediction performance of the attributesis determined at two difficulty-levels. 100% RCP (hard)refers to the use of all samples under the given circum-stances. 50% RCP (easy) refers to the 50% the predictionsof which the classifier is most sure about its correctness.In Table 2 the prediction performance is shown for CelebAincluding the assigned predictability classes. Tow generalobservations are made. First, the performance at the 50% a) CelebA (b) LFW
Figure 2: Label-correlation for CelebA and LFW. The attributes are chosen to show the 15 most positive and negativepairwise correlations. The attribute-correlation for LFW is shown after the label-cleaning process. Green indicate positivecorrelations, while red indicate a negative correlation. The correlation is based on the Pearson coefficient.RCP-level is always higher than for 100% RCP showingthat MAC learned reliable predictions on the dataset. Sec-ond, even if the prediction performance on FaceNet (FN)and ArcFace (AF) is very similar, the performance on FNis always slightly higher. This can be explained by Arc-Face’s margin-principle during training that distorts the fea-ture space more incoherently and thus, makes it harder forpattern learning. In total, many of CelebA attributes canbe predicted with high accuracy from face templates. Thisincludes demographic characteristics such as gender, char-acteristics of the person’s hairstyle, haircolor, and about thebeard. Moreover, the deeply encoded features also containhighly-detailed information about the person’s accessories.Table 3 shows the same evaluation setting on the LFWdatabase. The grey highlights refer to results with lim-ited significance since the label-cleaning process eliminatedmany samples with low-quality labels. The low number oftrain- and testing-samples explains some of the weak per-formance such as for
Baby , Sunglasses , and the
Mouth cat-egory. However, comparing the results of LFW with the re-sults of CelebA (Table 2) shows similar performances on at-tributes which occur in both datasets, such as demographicattributes, haircolors, face geometry etc. Consequently,our label-cleaning process removed low-quality attribute-labels but did not result in a large bias of the data. Dueto the entangled patterns encoded in the templates some at-tributes, such as
Bold , Bangs , and
Goatee , are easy to learnand thus, achieve high performances. Generally, the pre-diction performance using ArcFace embeddings is signif-icantly weaker than using FaceNet. ArcFace embeddings contain more complex attribute patterns and for the experi-ments on LFW less data was available for training, since wemanually filtered low-quality labels. Consequently, it canbe expected that with more training data the performance onArcFace is higher. Nevertheless, similar to CelebA, manyattributes can be predicted with high accuracies from thetemplates only. This goes for demographic attributes suchas gender, age, and race, as well as for hairstyle, haircolor,beard, and accessories. Moreover, characteristics about theface geometry such as face shape, double chin, and foreheadvisibility can be determined. Factors that do not belong tothe person, such as lighting conditions and blurriness, cannot be predicted reliably with the MAC. It is interesting tonote that the high predictability of
Attractive woman can beexplained by the high correlation to accessories.
From 113 investigated attributes, we found that 39 at-tributes belong to easily-predictable, 35 belong to pre-dictable and 39 to hardly-predictable. To obtain a moregeneral overview of the encoded information in face tem-plates, Table 4 summarizes the categories of the attributes inthe three predictability classes. The assignment of the cate-gories to the individual attributes is shown in Table 3. Pro-viding a more complete view of the problem, this table alsoincludes findings from related works. Since the face tem-plates are trained with the purpose of recognition, it seemslogical that categories such as
Face Geometry , PeriocularArea , Nose , and
Mouth are easily-predictable. Surprisingly,this is not the case. Instead, non-permanent factors such as
Hairstyle , Haircolor , Beard , Accessories , Head Pose , and ocial Traits are easily-predictable. Modern face recogni-tion systems aim to be robust against these factors and stillthese factors are strongly present in face templates.For many applications, the user of a face recognition sys-tem solely provides his biometric data for recognition. Toprevent a function creep of his data, face templates shouldcontain only identity-related information. However, the ex-periment showed that many privacy-sensitive attributes areencoded in face templates. This raises a major privacy risk.Consequently, future works might analyse the reason forthis rich encodings and find solutions to preserve privacyin face recognition systems.Table 4: Categorized summary of the predictability classesincluding findings of related works.Easily-predictable Predictable Hardly-predictableDemographics Face Geometry SkinHairstyle Periocular MouthHaircolor Nose EnvironmentBeard Image Quality [1]AccessoriesHead Pose [30]Social Traits [31]
6. Conclusion
The success of current face recognition systems is basedon the advances of deeply-learned templates. Recent workshave shown that demographics, image characteristics, andsocial traits are encoded in these templates. This can leadto biased decisions in face recognition systems and raisesmajor privacy issues. In many applications these templatesare expected to be used for recognition purposes only anddeducing information that is not required for recognition isconsidered as a violation of their privacy. The knowledgeof the encoded information in face templates is necessaryto develop effective bias-mitigating and privacy-preservingtechnologies. The main contribution of this work is an anal-ysis of what information is stored in face templates. Moreprecisely, 113 attributes are analyses towards their pre-dictability from face templates. The experiments were con-ducted on two popular face templates under two difficulty-levels. To facilitate the understandability of the results,each attribute was further categorized into one of three pre-dictability classes. Results reveal that about one third ofthe analysed attributes are easily-predictable, another thirdis predictable, and one third is hardly-predictable. De-spite that face recognition templates are trained to be robustagainst non-permanent factors, the results demonstrate thatespecially these attributes are accurately predictable fromface templates. Future works might build on the knowledgeof this work to develop comprehensive bias-mitigating andprivacy-preserving solutions for face recognition. Table 2: Prediction performance on CelebA: the perfor-mance is based on FaceNet (FN) and ArcFace (AF) em-beddings and is reported in terms of balanced accuracies attwo difficulty scenarios: 100% RCP (hard) and 50% RCP(easy). ++ , + , and state the assigned predictability class.100% RCP 50% RCPAttribute FN AF FN AF D e m o Male ++ + S k i n Pale Skin + H a i r s t y l e Bald ++ ++ + ++ H a i r c o l o r Black Hair + ++ + ++ B ea r d + ++ ++ ++ F ace G e o . Chubby + + + P e r i o c u l a r Arched Eyebrows + + M ou t h Big Lips + N o s e Pointy Nose A cce ss o r i e s Eyeglasses ++ ++ + ++ ++ + O t h e r Blurry + ++ , + , and state the assigned predictability class. Grey highlighting refers to reduced expressiveness due to limited data after thelabel-cleaning process. 100% RCP 50% RCPAttribute FN AF FN AF D e m og r a ph i c s Male ++ + + ++ ++ ++ ++ + S k i n Rosy Cheeks H a i r s t y l e Curly Hair ++ + ++ ++ ++ H a i r c o l o r Black Hair ++ ++ + ++ B ea r d No Beard ++ ++ ++ ++ F ace G e o m e t r y Oval Face + ++ + ++ + + + + + P e r i o c u l a r Eyes Open + + ++ + M ou t h Mouth Closed + N o s e Big Nose + ++ A cce ss o r i e s Heavy Makeup ++ + ++ + + ++ ++ ++ E nv i r on m e n t Blurry O t h e r Frowning ++ cknowledgement This work was supported by the German Federal Min-istry of Education and Research (BMBF) as well as by theHessen State Ministry for Higher Education, Research andthe Arts (HMWK) within the National Research Center forApplied Cybersecurity (ATHENE), and in part by the Ger-man Federal Ministry of Education and Research (BMBF)through the Software Campus project.
References [1] L. Best-Rowden and A. K. Jain. Learning face image qualityfrom human assessments.
IEEE Transactions on InformationForensics and Security , 13(12):3064–3077, Dec 2018.[2] B. Bortolato, M. Ivanovska, P. Rot, J. Krizaj, P. Terh¨orst,N. Damer, P. Peer, and V. Struc. Learning privacy-enhancingface representations through feature disentanglement. In . IEEE, 2020.[3] F. Boutros, N. Damer, P. Terh¨orst, F. Kirchbuchner, andA. Kuijper. Exploring the channels of multiple color spacesfor age and gender estimation from face images. In , pages 1–8.IEEE, 2019.[4] N. Damer, Y. Wainakh, V. Boller, S. von den Berken,P. Terh¨orst, A. Braun, and A. Kuijper. Crazyfaces: Unas-sisted circumvention of watchlist face identification. In , pages 1–9. IEEE, 2018.[5] A. Das, A. Dantcheva, and F. Bremond. Mitigating bias ingender, age and ethnicity classification: A multi-task con-volution neural network approach. In L. Leal-Taix´e andS. Roth, editors,
Computer Vision - ECCV 2018 Workshops- Munich, Germany, September 8-14, 2018, Proceedings,Part I , volume 11129 of
Lecture Notes in Computer Science ,pages 573–585. Springer, 2018.[6] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. ArcFace: Addi-tive angular margin loss for deep face recognition. In
TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2019.[7] S. Gong, X. Liu, and A. K. Jain. DebFace: De-biasing facerecognition.
CoRR , abs/1911.08080, 2019.[8] P. Grother, M. Ngan, and K. Hanaoka. Ongoing face recog-nition vendor test (frvt) part 2: Identification.
NIST Intera-gency/Internal Report (NISTIR) , 2018.[9] J. Guo, J. Deng, N. Xue, and S. Zafeiriou. Stacked denseu-nets with dual transformers for robust face alignment.In
British Machine Vision Conference 2018, BMVC 2018,Northumbria University, Newcastle, UK, September 3-6,2018 , page 44. BMVA Press, 2018.[10] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M:A dataset and benchmark for large-scale face recognition. InB. Leibe, J. Matas, N. Sebe, and M. Welling, editors,
Com- puter Vision - ECCV 2016 - 14th European Conference, Am-sterdam, The Netherlands, October 11-14, 2016, Proceed-ings, Part III , volume 9907 of
Lecture Notes in ComputerScience , pages 87–102. Springer, 2016.[11] J. Hernandez-Ortega, J. Galbally, J. Fi´errez, R. Haraksim,and L. Beslay. FaceQnet: Quality assessment for face recog-nition based on deep learning. In
IEEE International Con-ference on Biometrics, ICB 2019, Crete, Greece, June 4-7,2019 , Jun. 2019.[12] M. Q. Hill, C. J. Parde, C. D. Castillo, Y. I. Colon, R. Ranjan,J. Chen, V. Blanz, and A. J. O’Toole. Deep convolutionalneural networks in the face of caricature: Identity and imagerevealed.
CoRR , abs/1812.10902, 2018.[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.Labeled Faces in the Wild: A database for studying facerecognition in unconstrained environments. Technical Re-port 07-49, University of Massachusetts, Amherst, October2007.[14] S. Ioffe and C. Szegedy. Batch normalization: Acceler-ating deep network training by reducing internal covariateshift. In F. R. Bach and D. M. Blei, editors,
Proceedingsof the 32nd International Conference on Machine Learning,ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of
JMLR Workshop and Conference Proceedings , pages 448–456. JMLR.org, 2015.[15] V. Kazemi and J. Sullivan. One millisecond face alignmentwith an ensemble of regression trees. In , pages 1867–1874. IEEE Computer Society, 2014.[16] J. D. Kelleher, B. M. Namee, and A. DArcy.
Fundamentalsof Machine Learning for Predictive Data Analytics: Algo-rithms, Worked Examples, and Case Studies . The MIT Press,2015.[17] E. J. Kindt.
Biometric Data, Data Protection and the Rightto Privacy . Springer Netherlands, Dordrecht, 2013.[18] D. P. Kingma and J. Ba. Adam: A method for stochastic op-timization. In Y. Bengio and Y. LeCun, editors, , 2015.[19] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar.Attribute and simile classifiers for face verification. In
IEEE12th International Conference on Computer Vision, ICCV2009, Kyoto, Japan, September 27 - October 4, 2009 , pages365–372. IEEE Computer Society, 2009.[20] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Na-yar. Describable visual attributes for face verification andimage search.
IEEE Trans. Pattern Anal. Mach. Intell. ,33(10):1962–1977, 2011.[21] J. Liang, Y. Cao, C. Zhang, S. Chang, K. Bai, and Z. Xu.Additive adversarial learning for unbiased authentication.In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2019.[22] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning faceattributes in the wild. In
Proceedings of International Con-ference on Computer Vision (ICCV) , December 2015.23] V. Mirjalili, S. Raschka, and A. Ross. Flowsan: Privacy-enhancing semi-adversarial networks to confound arbitraryface-based gender classifiers.
IEEE Access , 7:99735–99745,2019.[24] V. Mirjalili, S. Raschka, and A. Ross. Privacynet: Semi-adversarial networks for multi-attribute face privacy, 2020.[25] V. Mirjalili and A. Ross. Soft biometric privacy: Retain-ing biometric utility of face images while perturbing gen-der. In ,pages 564–573. IEEE, 2017.[26] K. P. Murphy.
Machine learning : a probabilistic perspec-tive . MIT Press, Cambridge, Mass. [u.a.], 2013.[27] A. A. Othman and A. Ross. Privacy of facial soft biometrics:Suppressing gender but retaining identity. In L. Agapito,M. M. Bronstein, and C. Rother, editors,
Computer Vision- ECCV 2014 Workshops - Zurich, Switzerland, September6-7 and 12, 2014, Proceedings, Part II , volume 8926 of
Lec-ture Notes in Computer Science , pages 682–696. Springer,2014.[28] A. J. OToole, C. D. Castillo, C. J. Parde, M. Q. Hill, andR. Chellappa. Face space representations in deep convo-lutional neural networks.
Trends in Cognitive Sciences ,22(9):794 – 809, 2018.[29] G. ¨Ozbulak, Y. Aytar, and H. K. Ekenel. How transferableare cnn-based features for age and gender classification? InA. Br¨omme, C. Busch, C. Rathgeb, and A. Uhl, editors, , volume P-260 of
LNI , pages 39–50. GI / IEEE,2016.[30] C. J. Parde, C. D. Castillo, M. Q. Hill, Y. I. Colon,S. Sankaranarayanan, J. Chen, and A. J. O’Toole. Face andimage representation in deep CNN features. In , pages 673–680. IEEE Computer Society, 2017.[31] C. J. Parde, Y. Hu, C. D. Castillo, S. Sankaranarayanan, andA. J. O’Toole. Social trait information in deep convolutionalneural networks trained for face identification.
Cognitive Sci-ence , 43(6), 2019.[32] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: Aunified embedding for face recognition and clustering. In
IEEE Conference on Computer Vision and Pattern Recogni-tion, CVPR 2015, Boston, MA, USA, June 7-12, 2015 , pages815–823. IEEE Computer Society, 2015.[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: A simple way to prevent neuralnetworks from overfitting.
J. Mach. Learn. Res. , 15(1):1929–1958, Jan. 2014.[34] P. Terh¨orst, N. Damer, F. Kirchbuchner, and A. Kuijper. Sup-pressing gender and age in face templates using incrementalvariable elimination. In . IEEE,2019.[35] P. Terh¨orst, M. Huber, N. Damer, F. Kirchbuchner, andA. Kuijper. Unsupervised enhancement of soft-biometric pri- vacy with negative face recognition.
CoRR , abs/2002.09181,2020.[36] P. Terh¨orst, M. Huber, J. N. Kolf, N. Damer, F. Kirchbuchner,and A. Kuijper. Multi-algorithmic fusion for reliable age andgender estimation from face images. In , pages 1–8. IEEE, 2019.[37] P. Terh¨orst, M. Huber, J. N. Kolf, I. Zelch, N. Damer,F. Kirchbuchner, and A. Kuijper. Reliable age and gen-der estimation from face images: Stating the confidenceof model predictions. In . IEEE,2019.[38] P. Terh¨orst, J. N. Kolf, N. Damer, F. Kirchbuchner, andA. Kuijper. Post-comparison mitigation of demographic biasin face recognition using fair score normalization.
CoRR ,abs/2002.03592, 2020.[39] P. Terh¨orst, J. N. Kolf, N. Damer, F. Kirchbuchner, andA. Kuijper. SER-FIQ: Unsupervised estimation of face im-age quality based on stochastic embedding robustness, 2020.[40] P. Terh¨orst, K. Riehl, N. Damer, P. Rot, B. Bortolato,F. Kirchbuchner, V. Struc, and A. Kuijper. PE-MIU: Atraining-free privacy-enhancing face recognition approachbased on minimum information units.
IEEE Access ,8:93635–93647, 2020.[41] P. Terh¨orst, M. L. Tran, N. Damer, F. Kirchbuchner, andA. Kuijper. Comparison-level mitigation of ethnic bias inface recognition. In , pages 1–6. IEEE, 2020.[42] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Featuretransfer learning for face recognition with under-representeddata. In
IEEE Conference on Computer Vision and PatternRecognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 5704–5713. Computer Vision Foundation /IEEE, 2019.[43] Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction us-ing off-the-shelf CNN features. In
International Conferenceon Biometrics, ICB 2016, Halmstad, Sweden, June 13-16,2016 , pages 1–7. IEEE, 2016.[44] Y. Zhong, J. Sullivan, and H. Li. Leveraging mid-level deeprepresentations for predicting face attributes in the wild. In2016 IEEE International Conference on Image Processing,ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016