Mitigating Face Recognition Bias via Group Adaptive Classifier
MMitigating Face Recognition Bias via Group AdaptiveClassifier
Sixue Gong Xiaoming Liu Anil K. Jain
Michigan State University, East Lansing MI 48824 {gongsixu, liuxm, jain}@msu.edu
Abstract
Face recognition is known to exhibit bias - subjects in certain demographic groupcan be better recognized than other groups. This work aims to learn a fair facerepresentation, where faces of every group could be equally well-represented.Our proposed group adaptive classifier, GAC, learns to mitigate bias by usingadaptive convolution kernels and attention mechanisms on faces based on theirdemographic attributes. The adaptive module comprises kernel masks and channel-wise attention maps for each demographic group so as to activate different facialregions for identification, leading to more discriminative features pertinent totheir demographics. We also introduce an automated adaptation strategy whichdetermines whether to apply adaptation to a certain layer by iteratively computingthe dissimilarity among demographic-adaptive parameters, thereby increasing theefficiency of the adaptation learning. Experiments on benchmark face datasets(RFW, LFW, IJB-A, and IJB-C) show that our framework is able to mitigate facerecognition bias on various demographic groups as well as maintain the competitiveperformance.
Face recognition (FR) systems are known to exhibit discriminatory behaviors against certain demo-graphic groups [27, 37, 20]. The
NIST Face Recognition Vendor Test [20] shows that all
FR algorithms that participated in the test exhibit varying biased performances on gender, race, andage groups of a mugshot dataset. Deploying biased FR systems for law enforcement is potentiallyunethical [11]. Given the importance of automated FR-driven decisions, it is crucial to developfair and unbiased FR systems to avoid the negative societal impact. Note that we define FR bias asthe uneven recognition performance with respect to demographic groups, which differs from theinductive bias in machine learning [14].State-of-the-art (SOTA) FR algorithms [45, 65, 12] rely on convolutional neural networks (CNNs)trained on large-scale face datasets. The public training datasets for FR, e.g. , CASIA-WebFace [74],VGGFace2 [5], and MS-Celeb-1M [22], are collected by scraping face images off the web, withinevitable demographic bias [66]. Biases in data are transmitted to the FR models through networklearning. For example, to minimize the overall loss, a network tends to learn a better representationfor faces in the majority group whose number of faces dominate the training set, resulting in unequaldiscriminabilities. The imbalanced distribution of demographics in face data is, nevertheless, not theonly trigger of FR bias. Prior works have shown that even using a demographic balanced dataset [66]or training separate classifiers for each group [37], the performance on some groups is still inferior tothe others. By studying non-trainable FR algorithms, [37] introduced the notion of inherent bias , i.e. ,certain groups are inherently more susceptible to errors in the face matching process.To tackle the dataset-induced bias, traditional methods re-weight either the data proportions [6] orcost values [1]. Such methods are limited when applied to large-scale imbalanced datasets. Recentimbalance learning methods focus on novel objective functions for class-skewed datasets. For Preprint. Under review. a r X i v : . [ c s . C V ] J un A … × * … … Demographic
Attributes
Adaptive LayerA A AN N (a)
White Black East Asian South Asian
BaselineGAC (b) GAC reduces biasness ( . → . ).Figure 1: (a) Our proposed group adaptive classifier (GAC) automatically chooses between non-adaptive (“N”)and adaptive (“A”) layer in a multi-layer network, where the latter uses demographic-group-specific kernel andattention. (b) Compared to the baseline with the -layer ArcFace backbone, GAC improves face verificationaccuracy in most groups of RFW dataset [67], especially under-represented groups, leading to mitigated FR bias. instance, Dong et al . [17] propose a Class Rectification Loss to incrementally optimize on hardsamples of the classes with under-represented attributes. Alternatively, researchers strengthen thedecision boundary to impede perturbation from other classes by enforcing margins between hardclusters via adaptive clustering [30], or between rare classes via Bayesian uncertainty estimates [35].To adapt the aforementioned methods to racial bias mitigation, Wang et al . [66] modify the largemargin based loss functions by reinforcement learning. However, [66] requires two auxiliarynetworks, an offline sampling network and a deep Q-learning network, to generate adaptive marginpolicy for training the FR network, which hinders the learning efficiency.To mitigate FR bias, our main idea is to optimize the face representation learning on every demo-graphic group in a single network, despite demographically imbalanced training data. Conceptually,we may categorize face features into two types of patterns: general pattern is shared by all faces; differential pattern is relevant to demographic attributes. When the differential pattern of one specificdemographic group dominates training data, the network learns to predict identities mainly basedon that pattern as it is more convenient to minimize the loss than using other patterns, thus bringingit bias towards faces of that specific group. One mitigation is to give the network more capacity tobroaden its scope for multiple face patterns from different demographic groups. An unbiased FRmodel shall rely on not only unique patterns for recognition of different groups, but also generalpatterns of all faces for improved generalizability. Accordingly, as in Fig. 1, we propose a groupadaptive classifier (GAC) to explicitly learn these different feature patterns. GAC includes twomodules: the adaptive layer and automation module. The adaptive layer in GAC comprises adaptiveconvolution kernels and channel-wise attention maps where each kernel and attention map tacklefaces in one demographic group.Prior work on dynamic CNNs introduce adaptive convolutions to either every layer [33, 73, 68],or manually specified layers [47, 26, 63]. In contrast, this work proposes an automation moduleto choose which layers to apply adaptations. As we observed, not all convolutional layers requireadaptive kernels for bias mitigation (see Fig. 4a). At any layer of GAC, only kernels expressing highdissimilarity are considered as demographic-adaptive kernels. For those with low dissimilarity, theiraverage kernel is shared by all input images in that layer. Thus, the proposed network progressivelylearns to select the optimal structure for the demographic-adaptive learning. This enables that bothnon-adaptive layers with shared kernels and adaptive layers are jointly learned in a unified network.The contributions of the paper are summarised as: 1) A new face recognition algorithm that reducesdemographic bias and increases robustness of representations for faces in every demographic groupby adopting adaptive convolutions and attention techniques; 2) A new adaptation mechanism thatautomatically determines the layers to employ dynamic kernels and attention maps; 3) The proposedmethod achieves SOTA performance on a demographic-balanced dataset and three benchmarks. Fairness Learning and De-biasing Algorithms.
A variety of fairness techniques are proposed toprevent machine learning models from utilizing statistical bias in training data, including adversarialtraining [2, 25, 69, 48], subgroup constraint optimization [34, 81, 70], data pre-processing ( e.g. ,weighted sampling [21], and data transformation [4]), and algorithm post-processing [36, 54]. Anotherpromising approach learns a fair representation to preserve all discerning information about the data2
Adaptive KernelLayer k+1Layer k Kernel NetworkSide Information (a) Adaptive Kernel × Attention Map Layer k+1Layer kAttention Network * (b) Attention Map * Layer k+1Layer k × Demographic AttributesAdaptativeGroup Attention Map … GroupAdaptive Kernel … Adaptative Non-AdaptiveNon-Adaptive (c) GACFigure 2: A comparison of approaches in adaptive CNNs. attributes or task-related attributes but eliminate the prejudicial effects from sensitive factors [51,61, 77, 11, 23]. Locatello et al . [46] show the feature disentanglement is consistently correlatedwith increasing fairness of general purpose representations by analyzing , SOTA models.Accordingly, a disentangled representation is learned to de-bias both FR and demographic attributeestimation [19]. Other studies address the bias issue in FR by leveraging unlabeled faces to improvethe performance in groups with fewer samples [55, 67]. Wang et al . [66] propose skewness-awarereinforcement learning to mitigate racial bias in FR. Unlike prior work, our GAC is designed tocustomize the classifier for each demographic group, which, if successful, would lead to mitigatedbias.
Adaptive Neural Networks.
Three types of CNN-based adaptive learning techniques are related toour work: adaptive architectures, adaptive kernels, and attention mechanism. Adaptive architecturesdesign new performance-based neural functions or structures, e.g. , neuron selection hidden layers [29]and automatic CNN expansion for FR [79]. As CNN advances many AI fields, prior works proposedynamic kernels to realize content-adaptive convolutions. Li et al . [40] propose a shape-driven kernelfor facial trait recognition where each landmark-centered patch has a unique kernel. A convolutionfusion for graph neural networks is introduced by [18] where a set of varying-size filters are used perlayer. The works of [16] and [41] use a kernel selection scheme to automatically adjust the receptivefield size based on inputs. To better suit input data, [15] splits training data into clusters and learnsan exclusive kernel per cluster. Li et al . [42] introduce an adaptive CNN for object detection thattransfers pre-trained CNNs to a target domain by selecting useful kernels per layer. Alternatively,one may feed input images or features into a kernel function to dynamically generate convolutionkernels [62, 76, 39, 32]. Despite its effectiveness, such individual adaptation may not be suitablegiven the diversity of faces in demographic groups. Our work is most related to the side informationadaptive convolution [33], where in each layer a sub-network inputs auxiliary information to generatefilter weights. We mainly differ in that GAC automatically learns where to use adaptive kernels in amulti-layer CNN (see Figs. 2a and 2c), thus more efficient and capable in applying to a deeper CNN.As the human perception process naturally selects the most pertinent piece of information, attentionmechanisms are designed for a variety of tasks, e.g. , detection [78], recognition [9], image caption-ing [8], tracking [7], pose estimation [63], and segmentation [47]. Typically, attention weights areestimated by feeding images or feature maps into a shared network, composed of convolutionaland pooling layers [3, 9, 43, 60] or multi-layer perceptron (MLP) [28, 71, 57, 44]. Apart fromfeature-based attention, Hou et al . [26] propose a correlation-guided cross attention map for few-shotclassification where the correlation between the class feature and query feature generates the attentionweights. The work of [73] introduces a cross-channel communication block to encourage informationexchange across channels at the convolutional layer. To accelerate the channel interaction, Wang etal . [68] propose a D convolution across channels for attention prediction. Different from prior work,our attention maps are constructed by demographic information (see Figs. 2b and Fig. 2c), whichimproves the robustness of face representations in every demographic group.
Our goal is to train a FR network that is impartial to individuals in different demographic groups.Unlike image-related variations where face images with large poses or lower resolution are harderto be recognized, demographic attributes are subject-related properties with no apparent impact inrecognizability of identity, at least from a layman’s perspective. Thus, an unbiased FR system shouldbe able to obtain equally salient features for faces across all demographic groups. However, dueto imbalanced demographic distributions and inherent face differences between groups, it has beenshown that higher performance is achieved on certain groups even with hand-crafted features [37].3
Demographic Attribute Classifier ҧ𝜃 = 𝑛𝑑 𝑛𝑑 − 12 𝑖 𝑗 𝒗 𝑖 𝒗 𝑖 ∙ 𝒗 𝑗 𝒗 𝑗 … Automation Module 𝑽 ҧ𝜃 𝑽 ҧ𝜃𝒚 𝑫𝒆𝒎𝒐 𝑀 𝑀 𝑀 𝑛𝑑 … 𝑀 𝑗 * 𝐼 𝐹 𝑂 𝐹 = 𝑓 𝐼 𝐹 ∗ 𝐾 𝑦 𝐷𝑒𝑚𝑜 𝑂 𝑦 𝐷𝑒𝑚𝑜 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝑀 𝑗 𝑂 𝐹 𝑂 𝐹 𝑂 𝑦 𝐷𝑒𝑚𝑜 … 𝐾 𝑀1 𝐾 𝑀 𝐾 𝑀𝑛𝑑 𝐾 𝑐 𝐾 𝑀𝑗 × × 𝑀 𝑀 𝑀 𝑛𝑑 … 𝑀 𝑗 * 𝐼 𝐹 𝑂 𝐹 = 𝑓 𝐼 𝐹 ∗ 𝐾 𝑦 𝐷𝑒𝑚𝑜 𝑂 𝑦 𝐷𝑒𝑚𝑜 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝑀 𝑗 𝑂 𝐹 𝑂 𝐹 𝑂 𝑦 𝐷𝑒𝑚𝑜 … 𝐾 𝑀1 𝐾 𝑀𝑛𝑑 𝐾 𝑐 𝐾 𝑀𝑗 × × 𝐼 𝐹 𝑂 𝑦 𝐷𝑒𝑚𝑜 𝑂 𝑦 𝐷𝑒𝑚𝑜 ҧ𝜃 = 𝑛𝑑 𝑛𝑑 − 12 𝑖 𝑗 𝒗 𝑖 𝒗 𝑖 ∙ 𝒗 𝑗 𝒗 𝑗 Adaptation Module
Figure 3:
Overview of the proposed GAC for mitigating FR bias. GAC contains two major modules, i.e. , theadaptive layer and the automation module. The adaptive layer consists of adaptive kernels and attention maps.The automation module is employed to decide whether a layer should be adaptive or not.
Hence, it is impractical to extract features from different demographic groups that exhibit equaldiscriminability. Despite such disparity, a FR algorithm can still be designed to mitigate the differencein performance. To this end, we propose a CNN-based group adaptive classifier to utilize dynamickernels and attention maps to boost FR performance in all demographic groups considered here. Inparticular, GAC has two main modules, an adaptive layer and an automation module. In adaptivelayer, face images or feature maps are convolved with a unique kernel for each demographic group,and multiplied with adaptive attention maps to obtain demographic-differential features for facesin a certain group. The automation module determines in which layers of the network adaptivekernels and attention maps should be applied. Fig. 3 illustrates the overview of GAC. Given analigned face image, and its identity label y ID , a pre-trained demographic classifier first estimates itsdemographic attribute y Demo . With y Demo , the image is then fed into a recognition network withmultiple demographic adaptive layers to estimate the identity of the input. In the following, wepresent these two modules.
Adaptive Convolution.
For a standard convolution operation in CNN, an image or feature map fromthe previous layer I F ∈ R ic × ih × iw is convolved with a single kernel matrix K ∈ R kc × ic × kh × kw ,where ic is the number of input channels, kc the number of filters, ih and iw the input size, and kh and kw the filter size. Such an operation shares the kernel with every input that goes through thelayer, and is thus agnostic to demographic content, resulting in limited capacity to represent facesof groups with fewer samples. To mitigate the bias in convolution, we introduce a trainable matrixof kernel masks K M ∈ R nd × ic × kh × kw , where nd is the number of demographic groups. Duringthe forward pass, the demographic label y Demo and the kernel matrix K M are fed to the adaptiveconvolutional layer to generate demographic adaptive filters. Let K c ∈ R ic × kh × kw denote the c th channel filter, and the adaptive filter weights for c th channel are: K cy Demo = K c N K jM , (1)where K jM ∈ R ic × kh × kw is the j th kernel mask for group y Demo , and N denotes element-wisemultiplication. Then the c th channel of the output feature map is given by O cF = f ( I F ∗ K cy Demo ) ,where * denotes convolution, and f ( · ) is the activation function. In contrast to the conventionalconvolution, samples in every demographic group have a unique kernel K y Demo . Adaptive Attention.
Each channel filter in a CNN plays an important role in every dimension ofthe final representation, which can be viewed as a semantic pattern detector [8]. In the adaptiveconvolution, however, the values of a kernel mask are broadcast along the channel dimension,indicating that the weight selection is spatially varied but channel-wise joint. Hence, we introduce achannel-wise attention mechanism to enhance the face features that are demographic-adaptive. First,a trainable matrix of channel attention maps M ∈ R nd × kc is initialized in every adaptive attentionlayer. Given y Demo and the current feature map O F ∈ R kc × oh × ow , where oh and ow are the heightand width of O F , the c th channel of the new feature map is calculated by: O cy Demo = Sigmoid ( M jc ) · O cF , (2)4here M jc is the entry in the j th row of M for the demographic group y Demo at c th column. Incontrast to the adaptive convolution, elements of each demographic attention map M j diverge inchannel-wise manner, while the single attention weight M jc is spatially shared by the entire matrix O cF ∈ R oh × ow . The two adaptive matrices, K M and M , are jointly tuned with all the other parameterssupervised by the classification loss.Unlike dynamic CNNs [33] where additional networks are engaged to produce input-variant kernelor attention map, our adaptiveness is yielded by a simple thresholding function directly pointingto the demographic group with no auxiliary networks. Although the kernel network in [33] cangenerate continuous kernels without enlarging the parameter space, further encoding is required ifthe side inputs for kernel network are discrete variables. Our approach, in contrast, divides kernelsinto clusters so that the branch parameter learning can stick to a specific group without interferencefrom individual uncertainties, making it suitable for discrete domain adaptation. Further, the adaptivekernel masks in GAC are more efficient in terms of the number of additional parameters. Comparedto a non-adaptive layer, the number of additional parameters of GAC is nd × ic × kh × kw , while thatof [33] is id × kc × ic × kh × kw if the kernel network is a one-layer MLP, where id is the dimensionof input side information. Thus, for one adaptive layer, [33] has id × kcnd times more parameters thanours, which can be substantial given the typical large value of kc , the number of filters. Though faces in different demographic groups are adaptively processed by various kernels andattention maps, it is inefficient to use such adaptations in every layer of a deep CNN. To relieve theburden of unnecessary parameters and avoid empirical trimming, we adopt a similarity fusion processto automatically determine the adaptive layers. Since the same fusion scheme can be used for bothtypes of adaptation, we take the adaptive convolution as an example to illustrate this automatic scheme.First, a matrix composed of nd kernel masks is initialized in every convolutional layer. As the trainingcontinues, each kernel mask is updated independently to reduce face classification loss for eachdemographic group. Second, we reshape the kernel masks into D vectors V = [ v , v , . . . , v nd ] ,where v i ∈ R l , l = ic × kw × kh represents the kernel mask of the i th demographic group. Next, wecompute Cosine similarity between two kernel vectors, θ ij = v i k v i k · v j k v j k , where i, j ∈ { , , . . . , nd } .The average similarity of all pair-wise Cosine values is obtained by θ = nd ( nd − P i P j θ ij , i = j .If θ is higher than a pre-defined threshold τ , the kernel parameters in this layer reveal the demographic-agnostic property. Hence, we merge the nd kernels into a single kernel by taking the average alongthe group dimension. In the subsequent training, this single kernel can still be updated separately foreach demographic group, since the kernels may become demographic-adaptive in later epochs. Wemonitor the similarity trend of the adaptive kernels in each layer until θ is stable. Datasets : Our bias study uses RFW dataset [67] for testing and BUPT-Balancedface dataset [66]for training. RFW consists of faces in four race/ethnic groups: White, Black, East Asian, andSouth Asian . Each group contains ∼ K images of K individuals for face verification. BUPT-Balancedface contains . M images of K celebrities and is approximately race-balanced with Kidentities per race. Other than race, we also consider gender bias in face representation learning. Wecombine IMDB [56], UTKFace [80], AgeDB [50], AAF [10], AFAD [52] to train a gender classifier,which is used to estimate gender of faces on RFW and BUPT-Balancedface. All face images arecropped and resized to × pixels via landmarks detected by RetinaFace [13]. Implementation Details : We train a baseline network and GAC on BUPT-Balancedface, usingthe -layer ArcFace architecture [12]. The classification loss is an additive Cosine margin inCosface [65], with the scale and margin of s = 64 and m = 0 . . Training is optimized by SGD witha momentum of . , a weight decay . and a batch size . The learning rate starts from . anddrops to . following the schedule at , , epochs for the baseline, and , , epochsfor GAC. τ = 0 is chosen for automatic adaptation in GAC. Our FR models are trained to extracta -dim representation. Our demographic classifier uses a -layer ResNet [24]. Comparing theGAC and baseline, the average feature extraction speed per image on Nvidia Ti GPU is . msand . ms, and the number of model parameters is . M and . M, respectively. RFW [67] uses Caucasian, African, Asian, and Indian to name demographic groups. We adopt these groupsand accordingly rename to White, Black, East Asian, and South Asian for clearer race/ethnicity definition. ethod White Black East Asian South Asian Avg STDSOTA RL-RBN [66] .
27 95 .
00 94 .
82 94 .
68 95 .
19 0 . ACNN [33] .
12 94 .
00 93 .
67 94 .
55 94 .
58 0 . PFE [59] . . .
27 94 .
60 95 .
11 0 . Ablation Baseline .
18 93 .
98 93 .
72 94 .
67 94 .
64 1 . GAC-Channel .
95 93 .
67 94 .
33 94 .
78 94 .
68 0 . GAC-Kernel .
23 94 .
40 94 .
27 94 .
80 94 .
93 0 . GAC-Spatial .
97 93 .
20 93 .
67 93 .
93 94 .
19 1 . GAC-CS .
22 93 .
95 94 .
32 95 .
12 94 .
65 0 . GAC-CSK .
18 93 .
58 94 .
28 94 .
83 94 .
72 0 . GAC-( τ = − . ) . .
35 94 .
63 94 .
77 95 .
07 0 . GAC-( τ = 0 . ) .
25 93 .
95 93 .
82 94 .
77 94 .
70 0 . GAC .
23 94 . .
93 95 .
12 95 .
23 0 . Table 1:
Verification Accuracy (%) on the protocol of RFW [67].
Method Gender White Black East Asian South Asian Avg STDBaseline Male . ± .
08 96 . ± .
26 97 . ± .
09 97 . ± .
13 96 . ± .
03 0 . ± . Female . ± .
10 97 . ± .
11 95 . ± .
11 96 . ± . AL+Manual Male . ± .
10 98 . ± .
17 98 . ± . . ± . . ± .
05 0 . ± . Female . ± . . ± . . ± .
19 97 . ± . GAC Male . ± .
04 98 . ± .
20 98 . ± . . ± . . ± .
06 0 . ± . Female . ± . . ± . . ± .
12 97 . ± . Table 2:
Verification Accuracy (%) of -fold cross-validation on groups of RFW [67]. We first follow RFW face verification protocol with K pairs per race/ethnicity. The models aretrained on BUPT-Balancedface with ground truth race/ethnicity and identity labels. The commongroup fairness criteria like demographic parity distance are improper to evaluate fairness of learntrepresentations, since they are typically designed to measure independence properties of randomvariables. However, in FR the sensitive demographic characteristics are tied to identities, making thesetwo variables correlated. The NIST report proposes to use false negative and false positive for eachdemographic group to measure the fairness [20]. Instead of plotting false negative vs. false positives,we use a compact quantitative measure, i.e. , the standard deviation (STD) of the performance indifferent demographic groups, that was previously introduced in [66, 19] and called “biasness”. Wealso report average accuracy (Avg) to show the overall FR performance.
Ablation
Deep feature maps contain both spatial and channel-wise information. Here we investigatethe relationship among adaptive kernels, spatial and channel-wise attentions, and their impact to biasmitigation. We also study the impact of τ in our automation module. Apart from the baseline andGAC, we ablate seven variants: (1) GAC-Channel: channel-wise attention for race-differential feature;(2) GAC-Kernel: adaptive convolution with race-specific kernels; (3) GAC-Spatial: only spatialattention is added to baseline; (4) GAC-CS: both channel-wise and spatial attention; (5) GAC-CSK:combine adaptive convolution with spatial and channel-wise attention; (6,7) GAC-( τ = ∗ ): set τ to ∗ .Since the approach in ACNN [33] is related to GAC, we re-implement it and apply to the biasmitigation problem. First, we train a race classifier with the cross-entropy loss on BUPT-Balancedface.Then the softmax output of our race classifier is fed to a filter manifold network (FMN) to generateadaptive filter weights. Here, FMN is a two-layer MLP with a ReLU in between. Similar to GAC,race probabilities are considered as auxiliary information for face representation learning. We alsocompare with the SOTA approach PFE [59] via training a PFE model on BUPT-Balancedface.Tab. 1 reports the results of SOTA algorithms and ablation variants on RFW protocol. We makeseveral observations: (1) the baseline model is the most biased across race groups. (2) spatial attentionmitigates the race bias at the cost of verification accuracy, and is less effective on learning fair featuresthan other adaptive techniques. This is probably because spatial contents, especially local layoutinformation, only reside at earlier CNN layers, where the spatial dimensions are gradually decreasedby the following convolutions and poolings. Therefore, semantic details like demographic attributesare hardly encoded spatially. (3) Compared to GAC, combining adaptive kernels with both spatialand channel-wise attention increases the number of parameters, lowering the performance. (4) As τ determines the number of adaptive layers in GAC, it has a great impact on the performance. A small τ may increase redundant adaptive layers, while the adaptation layers may lack in capacity if toolarge. (5) GAC is superior to SOTA w.r.t. average performance and feature fairness. Compared tokernel masks in GAC, the FMN in ACNN [33] contains more trainable parameters. Applying it toeach convolutional layer is prone to overfitting. In fact, eight layers are empirically chosen as theFMN based convolution. (6) Even though PFE performs the best on standard benchmarks (Tab. 3),6 Threshold -0.1 Threshold 0 Threshold 0.1 0.150.000.15 (a) H i s t o g r a m Base-WhiteBase-BlackGAC-WhiteGAC-Black (b)Figure 4: (a) For each of the three τ values in automatic adaptation, we show the average negative Cosine valuesof the pair-wise demographic kernel masks, i.e. , θ , at − layers (y-axis), and − K training steps (x-axis).The number of adaptive layers in the three cases, i.e. , P ( θ > τ ) at K th step, are , , and , respectively.(b) With two race groups (White and Black in PCSO [37]) and two models (baseline and GAC), for each of the combinations, we compute pair-wise correlation of face representations using any two out of K subjects in thesame race, and plot the histogram of correlations. GAC reduces the difference/bias of two distributions.
Method LFW (%) Method IJB-A (%) IJB-C @ FAR (%)0.1% FAR 0.001% 0.01% 0.1%DeepFace+ [64] . Yin et al . [75] . ± . - - 69.3CosFace [65] . Cao et al . [5] . ± . . . . ArcFace [12] . Multicolumn [72] . ± . . . . PFE [59] . PFE [59] . ± . . . . Baseline . Baseline . ± . . . . GAC . GAC . ± . . . . Table 3:
Verification Performance on LFW, IJB-A, and IJB-C. [Key:
Best , Second , Third Best] it still exhibits high biasness. Our GAC outperforms PFE on RFW in both biasness and averageperformance. As the race data is a four-element input in our case, using extra kernel networks addscomplexity to the FR network, which degrades the verification performance. Fig. 6 shows pairs offalse positives (two faces falsely verified as the same identity) and false negatives (two faces falselyverified as different identities) produced by the baseline but successfully verified by GAC.
We now extend demographic attributes to both gender and race. First, we train two classifiersthat predict gender and race/ethnicity of a face image. The classification accuracy of gender andrace/ethnicity is and , respectively. Then, these fixed classifiers are affiliated with GACto provide demographic information for learning adaptive kernels and attention maps. We mergeBUPT-Balancedface and RFW, and split the subjects into sets for each of demographic groups. In -fold cross-validation, each time a model is trained on sets and tested on the remaining set.Here we demonstrate the efficacy of the automation module for training the GAC. We compare to thescheme of manually design (AL+Manual) that adds adaptive kernels and attention maps to a subset oflayers. Specifically, the first block in every residual unit is chosen to be the adaptive convolution layer,and channel-wise attentions are applied to the feature map output by the last block in every residualunit. As residual units are in our network and each block has convolutional layers, the manualscheme involves adaptive convolutional layers and groups of channel-wise attention maps. Asshown in Tab. 2, automatic adaptation is more effective in enhancing the discirminability and fairnessof face representations. Figure 4a shows the dissimilarity of kernel masks in the convolutional layerschanges during training epochs under three threshold values τ . A lower τ results in more adaptivelayers. We see the layers that are determined to be adaptive do vary across both layers (vertically)and training time (horizontally), which shows the importance of our automatic mechanism. This seemingly low accuracy is mainly due to the large dataset we assembled for training and testinggender/race classification. Our demographic classifier has been shown to perform comparably as SOTA oncommon benchmarks. While demographic estimation errors impact the training, testing, and evaluation of biasmitigation algorithms, the evaluation is of the most concern as errors in demographic labels may greatly impactthe biasness calculation. Thus, future development may include either manually cleaning the labels, or designinga biasness metric robust to demographic label errors. ace Mean STD Relative EntropyBaseline GAC Baseline GAC Baseline GACWhite .
15 1 .
17 0 .
30 0 .
31 0 . . Black .
07 1 .
10 0 .
27 0 .
28 0 .
61 0 . East Asian .
08 1 .
10 0 .
31 0 .
32 0 .
65 0 . South Asian .
15 1 .
18 0 .
31 0 .
32 0 .
19 0 . Table 4:
Distribution of ratios between minimum inter-class distance and maximum intra-class distance of facefeatures in race groups of RFW. GAC exhibits higher ratios, and more similar distributions to the reference. While our GAC mitigates bias, we also hope it can perform well on standard benchmarks. Therefore,we also evaluate GAC on standard benchmarks without considering demographic impacts, includingLFW [31], IJB-A [38], and IJB-C [49]. These datasets exhibit imbalanced distribution in demograph-ics. For a fair comparison with SOTA, instead of using ground truth demographics, we train GAC onMs-Celeb-1M [22] with the demographic attributes estimated by the classifier pre-trained in Sec. 4.2.As in shown Tab. 3, GAC outperforms the baseline and achieves comparable performance to SOTA.
Figure 5: The first row shows the average faces of groups in RFW. Thenext two rows show gradient-weighted class activation heatmaps [58] atthe th convolutional layer of the GAC (Row ) and baseline (Row ). False Negative False PositiveSouth Asian
East
AsianBlack
White
Figure 6: false positive and falsenegative pairs on RFW given by thebaseline but successfully verified byGAC. To understand the adaptive kernels in GAC, we visualize the feature maps at anadaptive layer for faces of various demographics, via a Pytorch visualization tool [53]. We visualizeimportant regions of faces pertaining to the FR decision by using a gradient-weighted class activationmapping (Grad-CAM) [58]. Grad-CAM uses the gradients back from the final layer corresponding toan input identity, and guides the target feature map to highlight import regions for identity predicting.Figure 5 shows that, compared to the baseline model, the salient regions of GAC demonstrate morediversity on faces from different groups. This illustrates the variability of parameters in GAC for eachgroup.
Bias via local geometry:
In addition to STD, we also explain the bias phenomenon via the localgeometry of a given face representation in each demographic group. We assume that the statisticsof neighbors of a given point (representation) reflects certain properties of its manifold (localgeometry). Accordingly, we first illustrate the pair-wise correlation of face representations. Tominimize variations caused by other latent variables, we use constrained frontal faces of a mug shotdataset, PCSO [37], to show the demographic impact on the divergence of face features. We randomlyselect K White and K Black subjects from PCSO, and compute their pair-wise correlation withineach race. In Fig. 4b, we discover that Base-White representations have lower inter-class correlationthan Base-Black, i.e. , faces in the White group are over-represented by the baseline than Black. Incontrast, GAC-White and GAC-Black shows more similarity in their correlation histograms.Since PCSO has few Asian subjects, we use RFW to design a second way to examine the localgeometry in four race groups. Specifically, after normalizing the representations, we compute thepair-wise Euclidean distance and measure the ratio between the minimum distance of inter-subjectspairs and the maximum distance of intra-subject pairs. We compute the mean and standard deviation(STD) of the ratio distributions in four groups, with two models. Also, we gauge the relative entropyto measure the deviation of the distributions from each other. For simplicity, we choose White groupas the reference distribution. Tab. 4 shows that, while GAC has minor improvement over baseline inthe mean, it gives smaller relative entropy in the other three groups, which indicates that the ratio8istributions of other races in GAC are more similar, i.e. , less biased, to the reference distribution.These results demonstrate the capability of GAC to increase fairness of face representations.
This paper tackles the issue of demographic bias in face recognition by learning a fair face repre-sentation. A group adaptive classifier (GAC) is proposed to improve robustness of representationsfor every demographic group considered here. Both adaptive convolution kernels and channel-wiseattention maps are introduced to GAC. We further add an automatic adaptation module to determinewhether to use adaptations in a given layer. Our findings suggest that faces can be better representedby using layers adaptive to different demographic groups, leading to more balanced performance gainfor all groups. As GAC is agnostic to network architecture, one of our future directions is to applyGAC to various backbone networks, for both validation and further improving the face recognitionperformance.
Broader Impact
As face recognition (FR) systems are being deployed in the real world for societal benefit, it isdesirable to develop an approach that is unbiased towards different demographic groups. De-biasinga FR algorithm while maintaining its average performance could be challenging due to the lackof discriminability in under-represented groups. Our approach addresses this problem via a groupclassifier mechanism that leverages both attention and adaptive learning strategies, which can beextended to other group fairness learning tasks as well.
References [1] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying support vector machines to imbalanceddatasets. In
ECML . Springer, 2004.[2] Mohsan Alvi, Andrew Zisserman, and Christoffer Nellåker. Turning a blind eye: Explicit removal of biasesand variation from deep neural network embeddings. In
ECCV , 2018.[3] Alexei A Bastidas and Hanlin Tang. Channel attention networks. In
CVPR Workshops , 2019.[4] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush RVarshney. Optimized pre-processing for discrimination prevention. In
NeurIPS , 2017.[5] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGface2: A dataset forrecognising faces across pose and age. In
FRGC . IEEE, 2018.[6] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: syntheticminority over-sampling technique.
Journal of Artificial Intelligence Research , 2002.[7] Boyu Chen, Peixia Li, Chong Sun, Dong Wang, Gang Yang, and Huchuan Lu. Multi attention module forvisual tracking.
Pattern Recognition , 2019.[8] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. SCA-CNN:Spatial and channel-wise attention in convolutional networks for image captioning. In
CVPR , 2017.[9] Zetao Chen, Lingqiao Liu, Inkyu Sa, Zongyuan Ge, and Margarita Chli. Learning context flexible attentionmodel for long-term visual place recognition.
IEEE Robotics and Automation Letters , 2018.[10] Jingchun Cheng, Yali Li, Jilong Wang, Le Yu, and Shengjin Wang. Exploiting effective facial patches forrobust gender recognition.
Tsinghua Science and Technology , 24(3):333–345, 2019.[11] Elliot Creager, David Madras, Joern-Henrik Jacobsen, Marissa Weis, Kevin Swersky, Toniann Pitassi, andRichard Zemel. Flexibly fair representation learning by disentanglement. In
ICML , 2019.[12] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss fordeep face recognition. In
CVPR , 2019.[13] Jiankang Deng, Jia Guo, Zhou Yuxiang, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. Retinaface:Single-stage dense face localisation in the wild. In arxiv , 2019.[14] Thomas G Dietterich and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance ofdecision tree algorithms. Technical report, Department of Computer Science, Oregon State University,1995.[15] Chen Ding, Ying Li, Yong Xia, Wei Wei, Lei Zhang, and Yanning Zhang. Convolutional neural networksbased hyperspectral image classification method with adaptive kernels.
Remote Sensing , 2017.[16] Chen Ding, Ying Li, Yong Xia, Lei Zhang, and Yanning Zhang. Automatic kernel size determination fordeep neural networks based hyperspectral image classification.
Remote Sensing , 2018.[17] Qi Dong, Shaogang Gong, and Xiatian Zhu. Imbalanced deep learning by minority class incrementalrectification.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018.[18] Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graphconvolutional networks. arXiv preprint arXiv:1710.10370 , 2017.[19] Sixue Gong, Xiaoming Liu, and Anil K Jain. Debface: De-biasing face recognition. arXiv preprintarXiv:1911.08080 , 2019.[20] Patrick Grother, Mei Ngan, and Kayee Hanaoka. Face recognition vendor test (FRVT) part 3: Demographiceffects. In
Technical Report, National Institute of Standards and Technology , 2019.
21] Aditya Grover, Jiaming Song, Ashish Kapoor, Kenneth Tran, Alekh Agarwal, Eric J Horvitz, and StefanoErmon. Bias correction of learned generative models using likelihood-free importance weighting. In
NeurIPS , 2019.[22] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset andbenchmark for large-scale face recognition. In
ECCV , 2016.[23] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In
NeurIPS ,2016.[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In
CVPR , 2016.[25] Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women alsosnowboard: Overcoming bias in captioning models. In
ECCV , 2018.[26] Ruibing Hou, Hong Chang, MA Bingpeng, Shiguang Shan, and Xilin Chen. Cross attention network forfew-shot classification. In
NeurIPS , 2019.[27] J Howard, Y Sirotin, and A Vemury. The effect of broad and specific demographic homogeneity on theimposter distributions and false match rates in face recognition algorithm performance. In
IEEE BTAS ,2019.[28] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In
CVPR , 2018.[29] Ting-Kuei Hu, Yen-Yu Lin, and Pi-Cheng Hsiu. Learning adaptive hidden layers for mobile gesturerecognition. In
AAAI , 2018.[30] Chen Huang, Yining Li, Change Loy Chen, and Xiaoou Tang. Deep imbalanced learning for facerecognition and attribute prediction.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,2019.[31] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: Adatabase forstudying face recognition in unconstrained environments. 2008.[32] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. In
NeurIPS ,2016.[33] Di Kang, Debarun Dhar, and Antoni Chan. Incorporating side information by adaptive convolution. In
NeurIPS , 2017.[34] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. An empirical study of rich subgroup fair-ness for machine learning. In
Proceedings of the Conference on Fairness, Accountability, and Transparency ,2019.[35] Salman Khan, Munawar Hayat, Syed Waqas Zamir, Jianbing Shen, and Ling Shao. Striking the rightbalance with uncertainty. In
CVPR , 2019.[36] Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairnessin classification. In
AAAI/ACM , 2019.[37] Brendan F Klare, Mark J Burge, Joshua C Klontz, Richard W Vorder Bruegge, and Anil K Jain. Facerecognition performance: Role of demographic information.
IEEE Trans. Information Forensics andSecurity , 7(6):1789–1801, 2012.[38] Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, PatrickGrother, Alan Mah, and Anil K Jain. Pushing the frontiers of unconstrained face detection and recognition:Iarpa janus benchmark a. In
CVPR , 2015.[39] Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range weatherprediction. In
CVPR , 2015.[40] Shaoxin Li, Junliang Xing, Zhiheng Niu, Shiguang Shan, and Shuicheng Yan. Shape driven kerneladaptation in convolutional neural network for robust facial traits recognition. In
CVPR , 2015.[41] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. In
CVPR , 2019.[42] Xudong Li, Mao Ye, Yiguang Liu, and Ce Zhu. Adaptive deep convolutional neural networks for scene-specific object detection.
IEEE Transactions on Circuits and Systems for Video Technology , 2017.[43] Hefei Ling, Jiyang Wu, Junrui Huang, Jiazhong Chen, and Ping Li. Attention-based convolutional neuralnetwork for deep face recognition.
Multimedia Tools and Applications , 2020.[44] Drew Linsley, D Schiebler, Sven Eberhardt, and Thomas Serre. Learning what and where to attend. In
ICLR , 2019.[45] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deephypersphere embedding for face recognition. In
CVPR , 2017.[46] Francesco Locatello, Gabriele Abbati, Thomas Rainforth, Stefan Bauer, Bernhard Schölkopf, and OlivierBachem. On the fairness of disentangled representations. In
NeurIPS , 2019.[47] Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more, knowmore: Unsupervised video object segmentation with co-attention siamese networks. In
CVPR , 2019.[48] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair andtransferable representations. In
ICML , 2018.[49] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain,W Tyler Niggel, Janet Anderson, Jordan Cheney, et al. IARPA janus benchmark-c: Face dataset andprotocol. In , 2018.[50] Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas, Jiankang Deng, Irene Kotsia, andStefanos Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In
CVPRW , 2017.[51] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. Invariant representa-tions without adversarial training. In
NeurIPS , 2018.
52] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Ordinal regression with multiple outputcnn for age estimation. In
CVPR , 2016.[53] Utku Ozbulak. Pytorch cnn visualizations. https://github.com/utkuozbulak/pytorch-cnn-visualizations , 2019.[54] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness andcalibration. In
NeurIPS , 2017.[55] Haoyu Qin. Asymmetric rejection loss for fairer face recognition. arXiv preprint arXiv:2002.03276 , 2020.[56] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a singleimage without facial landmarks.
IJCV , 2018.[57] Muhammad Sadiq, Daming Shi, Meiqin Guo, and Xiaochun Cheng. Facial landmark detection viaattention-adaptive deep network.
IEEE Access , 2019.[58] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, andDhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In
ICCV ,2017.[59] Yichun Shi and Anil K Jain. Probabilistic face embeddings. In
ICCV , 2019.[60] Vishwanath A Sindagi and Vishal M Patel. Ha-ccn: Hierarchical attention-based crowd counting network.
IEEE Transactions on Image Processing , 2019.[61] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. Learning controllablefair representations. In
ICAIS , 2019.[62] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik Learned-Miller, and Jan Kautz. Pixel-adaptiveconvolutional neural networks. In
CVPR , 2019.[63] Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. Multi-person pose estimation withenhanced channel-wise and spatial information. In
CVPR , 2019.[64] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap tohuman-level performance in face verification. In
CVPR , 2014.[65] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu.Cosface: Large margin cosine loss for deep face recognition. In
CVPR , 2018.[66] Mei Wang and Weihong Deng. Mitigate bias in face recognition using skewness-aware reinforcementlearning. arXiv preprint arXiv:1911.10692 , 2019.[67] Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. Racial faces in the wild: Reducingracial bias by information maximization adaptation network. In
ICCV , 2019.[68] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Efficientchannel attention for deep convolutional neural networks. arXiv preprint arXiv:1910.03151 , 2019.[69] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are notenough: Estimating and mitigating gender bias in deep image representations. In
CVPR , 2019.[70] Zeyu Wang, Klint Qinami, Yannis Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Rus-sakovsky. Towards fairness in visual recognition: Effective strategies for bias mitigation. arXiv preprintarXiv:1911.11834 , 2019.[71] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attentionmodule. In
ECCV , 2018.[72] Weidi Xie and Andrew Zisserman. Multicolumn networks for face recognition. arXiv preprintarXiv:1807.09192 , 2018.[73] Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, and Devi Parikh. Cross-channel communicationnetworks. In
NeurIPS , 2019.[74] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv:1411.7923 , 2014.[75] Xi Yin and Xiaoming Liu. Multi-task convolutional neural network for pose-invariant face recognition.
IEEE Trans. Image Processing , 27(2):964–975, 2017.[76] Julio Zamora Esquivel, Adan Cruz Vargas, Paulo Lopez Meyer, and Omesh Tickoo. Adaptive convolutionalkernels. In
ICCV Workshops , 2019.[77] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In
ICML , 2013.[78] Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, and Gang Wang. Progressive attention guidedrecurrent network for salient object detection. In
CVPR , 2018.[79] Yuanyuan Zhang, Dong Zhao, Jiande Sun, Guofeng Zou, and Wentao Li. Adaptive convolutional neuralnetwork and its application in face recognition.
Neural Processing Letters , 2016.[80] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoen-coder. In
CVPR , 2017.[81] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping:Reducing gender bias amplification using corpus-level constraints. In
EMNLP , 2017. itigating Face Recognition Bias via Group AdaptiveClassifier(Supplementary Material) Sixue Gong Xiaoming Liu Anil K. Jain
Michigan State University, East Lansing MI 48824 {gongsixu, liuxm, jain}@msu.edu
In this supplementary material we include; (1) Section 1: the statistics of datasets used in the experi-mental section, (2) Section 2: Performance of the pre-trained gender and race/ethnicity classifiers toprovide GAC with demographic information, (3) Section 3: Bias comparison between GAC and itsablation variants.
Tab. 1 summarizes the datasets we adopt for conducting experiments, which reports the total numberof face images and subjects (identities), and the types of demographic annotations. In the cross-validation experiment In Tab. 2, we report the statistics of each data fold for the cross-validationexperiment on BUPT-Balancedface and RFW datasets.
Datasets ,
723 20 , Gender, AgeUTKFace [14] , - Gender, Age, Race/ethnicityAgeDB [8] ,
488 567
Gender, AgeAFAD [9] , - Gender, Age, Ethnicity (East Asian)AAF [1] ,
322 13 , Gender, AgeRFW [13] , - Race/EthnicityBUPT-Balancedface [12] , ,
430 28 , Race/EthnicityIMFDB-CVIT [11] ,
512 100
Gender, Age Groups, Ethnicity (South Asian)MS-Celeb-1M [4] , ,
653 85 , No Demographic LabelsPCSO [2] , ,
607 5 , Gender, Age, Race/EthnicityLFW [5] ,
233 5 , No Demographic LabelsIJB-A [6] ,
813 500
Gender, Age, Skin ToneIJB-C [7] ,
334 3 , Gender, Age, Skin Tone
Table 1: Statistics of training and testing datasets for the experiments in the paperFold White ( ,
991 68 ,
159 1 ,
999 67 ,
880 1 ,
898 67 ,
104 1 ,
996 57 , ,
991 67 ,
499 1 ,
999 65 ,
736 1 ,
898 66 ,
258 1 ,
996 57 , ,
991 66 ,
091 1 ,
999 65 ,
670 1 ,
898 67 ,
696 1 ,
996 56 , ,
991 66 ,
333 1 ,
999 67 ,
757 1 ,
898 65 ,
341 1 ,
996 57 , ,
994 68 ,
597 1 ,
999 67 ,
747 1 ,
898 68 ,
763 2 ,
000 56 , Table 2: Statistics of Dataset Folds in the Cross-validation ExperimentPreprint. Under review. a r X i v : . [ c s . C V ] J un emale Male N u m b e r o f I m a g e s TrainingTesting (a) Gender Distribution
White Black East Asian South Asian N u m b e r o f I m a g e s TrainingTesting (b) Race DistributionFigure 1: Statistics of the datasets for training and testing demographic attribute estimation networks. (a) Thenumber of images in each gender group of the datasets for gender estimation; (b) The number of images in eachrace/ethnicity group of the datasets for race/ethnicity estimation. F e m a l e M a l e (a) Gender Estimation W h i t e B l a c k E a s t A s i a n S o u t h A s i a n (b) Race/Ethnicity EstimationFigure 2: Performance of the demographic attribute estimation networks. (a) The classification accuracy ineach gender group; (b) The classification accuracy in each race/ethnicity group. The red dashed line shows theaverage performance. We train a gender classifier and a race/ethnicity classifier to provide GAC with demographic informa-tion during both training and testing procedures. We use the same datasets for training and evaluatingthe two demographic attribute classifiers as the work of [3]. The combination of IMDB, UTKface,AgeDB, AFAD, and AAF is used for gender estimation, and the collection of AFAD, RFW, IMFDB-CVIT, and PCSO is used for race/ethnicity estimation. Fig. 1 shows the total number of images ineach demographic group of the training and testing set. Fig. 2 shows the performance of demographicattribute estimation on the testing set. For gender estimation, we see that the performance in themale group is better than that in the female group. For race/ethnicity estimation, the white groupoutperforms than the other race/ethnicity groups.
We extend Tab.4 in the main paper and compare the proposed GAC with other ablation variants. Tab. 3reports the distribution parameters of the features extracted by different networks. By comparing therelative entropy (RE), we notice that GAC gives the smallest values in the three race/ethnicity groupsthan the other ablation methods, which shows the efficacy of GAC to mitigate the demographic bias.2 ethod White Black East Asian South AsianMean STD RE Mean STD RE Mean STD RE Mean STD REBaseline .
15 0 .
30 0 .
00 1 .
07 0 .
27 0 .
61 1 .
08 0 .
31 0 .
65 1 .
15 0 .
31 0 . GAC-Channel .
17 0 .
30 0 .
00 1 .
11 0 .
28 0 .
43 1 .
10 0 .
32 0 .
63 1 .
17 0 .
31 0 . GAC-Kernel .
18 0 .
29 0 .
00 1 .
09 0 .
28 0 .
42 1 .
10 0 .
31 0 .
59 1 .
17 0 .
31 0 . GAC-Spatial .
14 0 .
32 0 .
00 1 .
10 0 .
29 0 .
60 1 .
10 0 .
30 0 .
65 1 .
16 0 .
30 0 . GAC-CS .
16 0 .
31 0 .
00 1 .
09 0 .
28 0 .
46 1 .
09 0 .
32 0 .
62 1 .
17 0 .
31 0 . GAC-CSK .
17 0 .
31 0 .
00 1 .
11 0 .
28 0 .
51 1 .
10 0 .
32 0 .
63 1 .
18 0 .
31 0 . GAC-( τ = − . ) .
17 0 .
31 0 .
00 1 .
11 0 .
28 0 .
43 1 .
10 0 .
32 0 .
61 1 .
17 0 .
30 0 . GAC-( τ = 0 . ) .
16 0 .
31 0 .
00 1 .
10 0 .
27 0 .
45 1 .
10 0 .
32 0 .
62 1 .
18 0 .
32 0 . GAC .
17 0 .
31 0 .
00 1 .
10 0 .
28 0 .
43 1 .
10 0 .
32 0 .
58 1 .
18 0 .
32 0 . Table 3:
Distribution of ratios between minimum inter-class distance and maximum intra-class distance of facefeatures in race groups of RFW. GAC exhibits higher ratios, and more similar distributions to the reference. References [1] Jingchun Cheng, Yali Li, Jilong Wang, Le Yu, and Shengjin Wang. Exploiting effective facial patches forrobust gender recognition.
Tsinghua Science and Technology , 24(3):333–345, 2019.[2] Debayan Deb, Lacey Best-Rowden, and Anil K Jain. Face recognition performance under aging. In
CVPRW , 2017.[3] Sixue Gong, Xiaoming Liu, and Anil K Jain. Debface: De-biasing face recognition. arXiv preprintarXiv:1911.08080 , 2019.[4] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset andbenchmark for large-scale face recognition. In
ECCV , 2016.[5] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: Adatabase forstudying face recognition in unconstrained environments. 2008.[6] Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, PatrickGrother, Alan Mah, and Anil K Jain. Pushing the frontiers of unconstrained face detection and recognition:Iarpa janus benchmark a. In
CVPR , 2015.[7] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain,W Tyler Niggel, Janet Anderson, Jordan Cheney, et al. IARPA janus benchmark-c: Face dataset andprotocol. In , 2018.[8] Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas, Jiankang Deng, Irene Kotsia, andStefanos Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In
CVPRW , 2017.[9] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Ordinal regression with multiple outputcnn for age estimation. In
CVPR , 2016.[10] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a singleimage without facial landmarks.
IJCV , 2018.[11] Parisa Beham Jyothi Gudavalli Menaka Kandasamy Radhesyam Vaddi Vidyagouri Hemadri J C KarureRaja Raju Rajan Vijay Kumar Shankar Setty, Moula Husain and C V Jawahar. Indian Movie Face Database:A Benchmark for Face Recognition Under Wide Variations. In
NCVPRIPG , 2013.[12] Mei Wang and Weihong Deng. Mitigate bias in face recognition using skewness-aware reinforcementlearning. arXiv preprint arXiv:1911.10692 , 2019.[13] Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. Racial faces in the wild: Reducingracial bias by information maximization adaptation network. In
ICCV , 2019.[14] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoen-coder. In
CVPR , 2017., 2017.