Fairness with Overlapping Groups
FF AIRNESS WITH O VERLAPPING G ROUPS
A P
REPRINT
Forest Yang ∗ UC Berkeley
Moustapha Cisse
Google Research Accra
Sanmi Koyejo
Google Research Accra & Illinois A BSTRACT
In algorithmically fair prediction problems, a standard goal is to ensure the equality of fairness metricsacross multiple overlapping groups simultaneously. We reconsider this standard fair classificationproblem using a probabilistic population analysis, which, in turn, reveals the Bayes-optimal classifier.Our approach unifies a variety of existing group-fair classification methods and enables extensionsto a wide range of non-decomposable multiclass performance metrics and fairness measures. TheBayes-optimal classifier further inspires consistent procedures for algorithmically fair classificationwith overlapping groups. On a variety of real datasets, the proposed approach outperforms baselinesin terms of its fairness-performance tradeoff.
Machine learning inform an increasingly large number of critical decisions in diverse settings. They assist medicaldiagnosis (McKinney et al., 2020), guide policing (Meijer and Wessels, 2019), and power credit scoring systems (Tsaiand Wu, 2008). While they have demonstrated their value in many sectors, they are prone to unwanted biases,leading to discrimination against protected subgroups within the population. For example, recent studies have revealedbiases in predictive policing and criminal sentencing systems (Meijer and Wessels, 2019; Chouldechova, 2017). Theblossoming body of research in algorithmic fairness aims to study and address this issue by introducing novel algorithmsguaranteeing a certain level of non-discrimination in the predictions. Each such algorithm relies on a specific definitionof fairness, which falls into one of two categories: Individual fairness (Dwork et al., 2012; Zemel et al., 2013) or groupfairness (Calders and Verwer, 2010; Kamishima et al., 2011; Hardt et al., 2016a). The vast majority of the algorithmicgroup fairness literature has focused on the simplest case where there are only two groups. In this paper, we considerthe more nuanced case of group fairness with respect to multiple groups.The simplest setting is the independent case, with only one sensitive attribute which can take multiple values, e.g.,race only. The presence of multiple sensitive attributes (e.g., race and gender simultaneously) leads to non-equivalentdefinitions of group fairness. On the one hand, fairness can be considered independently per sensitive attribute, leadingto overlapping subgroups. For example, consider a model restricted to demographic parity between subgroups definedby ethnicity. Simultaneously, the model can be constrained to fulfill demographic parity between subgroups definedby gender. We term fairness in this situation independent group fairness . On the other hand, one can consider allsubgroups defined by intersections of sensitive attributes (e.g., ethnicity and gender), leading to intersectional groupfairness . A given algorithm can be independently group fair , e.g., when considering race and gender in isolation, but not intersectionally group fair , e.g., when considering intersections of racial and gender groups. For example, Buolamwiniand Gebru (2018), showed how facial recognition software had a particularly poor performance for black women.This phenomenon, called fairness gerrymandering , has been studied by Kearns et al. (2018). Intersectional fairnessis often considered ideal. However, it comes with major statistical and computational hurdles such as data scarcityat intersections of minority groups, and the potentially exponential number of subgroups. Indeed, current algorithmsconsist of either brute force enumeration or searching via a cost-sensitive classification problem, and intersectional ∗ Work completed while an intern at Google Research Accra. a r X i v : . [ c s . L G ] J un airness with Overlapping Groups A P
REPRINT groups are often empty with finite samples (Kearns et al., 2018). On the other hand, independent group fairness stillprovides a broad measure of fairness and is much easier to enforce.We seek to design unifying statistically consistent strategies for group fairness and to clarify the relationship between theexisting definitions.
Our main results and algorithms apply to arbitrary overlapping group definitions. Our contributionsare summarized in the following. • Probabiistic results . We characterize the population optimal (also known as the Bayes-optimal) predictionprocedure for multiclass classification, where all the metrics are general linear functions of the confusionmatrix. We consider both overlapping (independent, gerrymandering) and non-overlapping (unrestricted,intersectional) group fairness. • Algorithms and statistical results.
Inspired by the population optimal, we propose simple plugin andweighted empirical risk minimization (ERM) approaches for algorithmically fair classification, and prove theirconsistency, i.e., the empirical estimator converges to the population optimal with sufficiently large samples.Our general approach recovers existing results for plugin and weighted ERM group-fair classifiers. • Comparisons.
We compare independent group fairness to the overlapping case. We show that intersectionalfairness implies overlapping group fairness under weak conditions. However, the converse is not true, i.e.,overlapping fairness may not imply intersectional fairness. This result formalizes existing observations on thedangers of gerrymandering. • Evaluation.
Empirical results are provided to highlight our theoretical claims.Taken together, our results unify and advance the state of the art with respect to the probabilistic, statistical, andalgorithmic understanding of group-fair classification. The generality of our approach gives significant flexibility to thealgorithm designer when constructing algorithmically-fair learners.
Throughout the paper, we use uppercased bold letters to represent matrices, and lowercased bold letters to representvectors. Let e i represent the i th standard basis whose i th dimension is 1 and 0 otherwise e i = (0 , · · · , , · · · , . Wedenote as the all-ones vector with dimension inferred from context. Given two matrices A , B of same dimension, (cid:104) A , B (cid:105) = (cid:80) i,j a ij b ij is the Frobenius inner product. For any quantity q , ˆ q denotes an empirical estimate. Due tolimited space, proofs are presented in the appendix. Group notation.
We assume M sensitive attributes, where each attribute is indicated by a group {A m } m ∈ [ M ] . Forexample, A may correspond to race, A may correspond to gender, and so on. Combined, the sensitive group indicatoris represented by a M -dimensional vector a ∈ A = A × A × · · · A M . In other words, each instance is associatedwith M subgroups simultaneously. Probabilistic notation.
Consider the multiclass classification problem where Z denotes the instance space and Y = [ K ] denotes the output space with K classes. We assume the instances, outputs and groups are samples from a probabilitydistribution P over the domain Y × Z × A . A dataset is given by n samples ( y ( i ) , z ( i ) , a ( i ) ) i.i.d ∼ P , i ∈ [ n ] . To simplifynotation, let X = Z × A , so x = ( z , a ) . Define the set of randomized classifiers H r = { h : X × A → (∆ K ) } , where ∆ q = { p ∈ [0 , q : (cid:80) qi =1 p i = 1 } is the q − dimensional probability simplex. A classifier h is associated with therandom variable h ∈ [ K ] defined by P ( h = k | x ) = h k ( x ) . If h is deterministic, then we can write h ( x ) = e h ( x ) . Confusion matrices.
For any multiclass classifier, let η ( x ) ∈ ∆ K denote the class probabilities for any given instance x and sensitive attribute a , whose k th element is the conditional probability of the output belonging to class k , i.e., η k ( x ) = P ( Y = k | X = x ) . The population confusion matrix is C ∈ [0 , K × K , with elements defined for k, (cid:96) ∈ [ K ] as C k,(cid:96) = P ( Y = k, h = (cid:96) ) , or equivalently, C k,(cid:96) = (cid:90) x η k ( x ) h (cid:96) ( x ) d P ( x ) . Group-specific confusion matrices.
Let G represent a set of subsets of the instances, i.e., potentially overlappingpartitions of the instances X . We leave G as generic for now, and will specify cases specific to fairness in the following.Given any group g ∈ G , we can define the group-specific confusion matrix C g ∈ [0 , K × K , with elements defined for k, (cid:96) ∈ [ K ] , where C gk,(cid:96) = (cid:90) x η k ( x ) h (cid:96) ( x ) d P ( x | x ∈ g ) . A P
REPRINT
We will abbreviate the event { x ∈ g } to simply g when it is clear from context. Let π g = P ( X ∈ g ) be theprobability of group g . It is clear that when the groups G form a partition, i.e., a ∩ b = ∅ ∀ a, b ∈ G and (cid:83) g ∈G g = X ,the population confusion may be recovered by a weighted average of group confusions, C = (cid:80) g ∈G π g C g . Let ω k = P ( Y = k ) = (cid:80) (cid:96) C k,(cid:96) be the probability of label k , and ω gk = P ( Y = k | X ∈ g ) = (cid:80) (cid:96) C gk,(cid:96) be the probabilityof label k given group g . The sample confusion matrix is defined as (cid:98) C [ h ] = n (cid:80) ni =1 (cid:98) C ( i ) [ h ] , where (cid:98) C ( i ) [ h ] ∈ [0 , K × K , and (cid:98) C ( i ) k,(cid:96) [ h ] = (cid:74) y i = k (cid:75) h (cid:96) ( x i ) . Here, (cid:74) · (cid:75) is the indicator function, so (cid:80) Kk =1 (cid:80) K(cid:96) =1 (cid:98) C ( i ) k,(cid:96) [ h ] = 1 . The empirical group-specificconfusion matrices (cid:98) C g are computed by conditioning on groups. In the empirical case, it is convenient to representgroup memberships via indices alone, i.e., x i ∈ g as i ∈ g . We have (cid:98) C g [ h ] = | g | (cid:80) i ∈ g (cid:98) C ( i ) [ h ] . Fairness constraints.
Let G fair represent the (potentially overlapping) set of groups across which we wish to enforcefairness. The following states our formal assumptions on G fair . Assumption 2.1. G fair is a function of the sensitive attributes A only.We will focus the discussion on common cases in the literature. These include non-overlapping (unrestricted, intersec-tional), and overlapping (independent, gerrymandering) group partitions. • Unrestricted case.
The simplest case is where the group is defined by a single sensitive attribute (when thereare multiple sensitive attributes, all but one are ignored). These have been the primary settings addressed bypast literature (Hardt et al., 2016a; Narasimhan, 2018; Agarwal et al., 2018). Thus for some fixed i ∈ [ M ] , g j = { ( z , a ) | a i = j } , so |G unrestricted | = | A i | . In the special case of binary sensitive attributes, |G unrestricted | = 2 . • Intersectional groups . Here, the non-overlapping groups are associated with all possible combinations ofsensitive features. Thus g a = { ( z , a (cid:48) ) | a (cid:48) = a } ∀ a ∈ A so |G intersectional | = (cid:81) m ∈ M | A m | . In the special caseof binary sensitive attributes, |G intersectional | = 2 M . • Independent groups . Here, the groups are overlapping, with a set of groups associated with each fairnessattribute separately. It is convenient to denote the groups based on indices representing each attribute, and eachpotential setting. Thus g i,j = { ( z , a ) | a i = j } , so |G independent | = (cid:80) m ∈ M | A m | . In the special case of binarysensitive attributes, |G independent | = 2 M . • Gerrymandering intersectional groups . Here, group intersections are defined by any subset of the sensitiveattributes, leading to overlapping subgroups. G gerrymandering = {{ ( z , a ) : a I = s } : I ⊂ [ M ] , s ∈ A I } where a I denotes a restricted to the entries indexed by I . It is also the closure of G independent under intersection. Asa result, G intersectional ⊆ G gerrymandering , and G independent ⊆ G gerrymandering . In the special case of binary sensitiveattributes, |G gerrymandering | = 3 M . Fairness metrics.
We formulate group fairness by upper bounding a fairness violation function V : H (cid:55)→ R J which can be represented as a linear function of the confusion matrices, i.e. V ( h ) = Φ( C [ h ] , { C g [ h ] } g ∈G fair ) where ∀ j ∈ [ J ] , V ( h ) j = φ j ( C [ h ] , { C g [ h ] } g ∈G fair ) = (cid:104) U j , C (cid:105)− (cid:80) g ∈G fair (cid:10) V gj , C g (cid:11) . This formulation is sufficiently flexibleto include the fairness statistics we are aware of in common use as special cases. For example, demographic parity forbinary classifiers (Dwork et al., 2012) can be defined by fixing C g , + C g , across groups. Equal opportunity (Hardtet al., 2016b) is recovered by fixing the group-specific true positives, using population specific weights, i.e., φ ± DP = ± ( C g , + C g , − C , + C , ) − ν, φ ± EO = ± (cid:18) ω g C g , − ω C , (cid:19) − ν, using both a positive and negative constraint to penalize both positive and negative deviations between the group andthe population, and relaxation ν . Performance metrics.
We consider an error metric E : H (cid:55)→ R + that is a linear function of the populationconfusion E ( h ) = ψ ( C ) = (cid:104) D , C [ h ] (cid:105) . This setting has been studied in binary classification (Yan et al., 2018),multiclass classification (Narasimhan et al., 2015), multilabel classification (Koyejo et al., 2015), and multioutputclassification (Wang et al., 2019). For instance, standard classification error corresponds to setting D = 1 − I . The goalis to learn the Bayes-optimal classifier with respect to the given metric, which, when it exists, is given by: h ∗ ∈ argmin h E ( h ) s.t. V ( h ) ≤ . (1)We denote the optimal error as E ∗ = E ( h ∗ ) . We say a classifier h N constructed using finite data of size N is {E , V} -consistent if E ( h n ) P −→ E ∗ and V ( h n ) P −→ , as n → ∞ . We also consider empirical versions of error ˆ E ( h ) = ψ ( (cid:98) C [ h ]) and fairness violation (cid:98) V ( h ) = Φ( (cid:98) C [ h { (cid:98) C g [ h ] } g ) . 3airness with Overlapping Groups A P
REPRINT
Table 1: Examples of multiclass performance metrics and fairness metrics studied in this manuscript.Metric ψ ( C ) Fairness Metric φ ( C , { C g } g ) Weighted Acc. (cid:80) Ki =1 (cid:80) Kj =1 b i,j C i,j Demographic Parity ( C g , + C g , − C , + C , ) − ν Ordinal Acc. (cid:80) Ki =1 (cid:80) Kj =1 (1 − K − | i − j | ) C i,j Equalized Opportunity (cid:16) ω g C g , − ω g C , (cid:17) − ν In this section, we identify a parametric form for the Bayes-optimal group-fair classifier under standard assumptions.To begin, we introduce the following general assumption on the joint distribution.
Assumption 3.1 ( η -continuity) . Assume P ( { η ( x ) = c } ) = 0 ∀ c ∈ ∆ K . Furthermore, let Q = η ( x ) be a randomvariable with density p η ( Q ) , where p η ( Q ) is absolutely continuous with respect to the Lebesgue measure restricted to ∆ K .This assumption imposes that the conditional probability as a random variable has a well-defined density. Analogousregularity assumptions are widely employed in literature on designing well-defined complex classification metrics andseem to be unavoidable (we refer interested reader to Yan et al. (2018); Narasimhan et al. (2015) for details). Next, wedefine the general form of weighted multiclass classifiers, which are the Bayes-optimal classifiers for linear metrics. Definition 3.2. [Narasimhan et al. (2015)] Given a loss matrix W ∈ R K × K , a weighted classifier h satisfies h i ( x ) > only if i ∈ arg min k ∈ [ K ] (cid:104) W k , η ( x ) (cid:105) .Next we present our first main result identifying the Bayes-optimal group-fair classifier. Theorem 3.1.
Under Assumption 2.1 and Assumption 3.1, if (1) is feasible (i.e., a solution exists), the Bayes-optimalclassifier is given by h ∗ ( x ) = h ∗ ( z , a ) = β a h ( x ) + (1 − β a ) h ( x ) , where β a ∈ (0 , , ∀ a ∈ A and h i ( x ) areweighted classifiers with weights {{ W i, a } i ∈{ , } } a ∈A . One key observation is that pointwise, the Bayes-optimal classifier can be decomposed based on intersectional groups G intersectional = A , even when G fair is overlapping. This observation will prove useful for algorithms. Recent research Kearns et al. (2018) has shown how imposing overlapping group fairness using independent fairnessrestrictions can lead to violation of intersectional fairness, primarily via examples. This observation led to the term fairness gerrymandering . Here, we examine this claim more formally, showing that enforcing intersectional fairnesscontrols overlapping fairness, although the converse is not always true, i.e., enforcing overlapping fairness does notimply intersectional fairness. We show this result for the general case of quasi-convex fairness measures, with linearfairness metrics recovered as a special case.
Proposition 3.2.
For any G fair that satisfies assumption 2.1, suppose φ : [0 , K × K × [0 , K × K → R + is quasiconvex, φ ( C , C g ) ≤ ∀ g ∈ G intersectional = ⇒ φ ( C , C g ) ≤ ∀ g ∈ G fair . The converse does not hold.
Remark 3.3.
Note that the converse claim of Proposition 3.2, does not apply to G gerrymandering . Controlling the gerry-mandering fairness violation implies control of the intersectional fairness violation, since G intersectional ⊆ G gerrymandering . Here we present
GroupFair , a general empirical procedure for solving (1). The Lagrangian of the constrainedoptimization problem (1) is L ( h , λ ) = E ( h ) + λ (cid:62) V ( h ) with empirical Lagrangian ˆ L ( h , λ ) = ˆ E ( h ) + λ (cid:62) ( V ( h ) − ε ) ,where ε is a buffer for generalization. Our approach involves finding a saddle point of the Lagrangian. The returnedclassifiers will be probabilistic combinations of classifiers in H , i.e. the procedure returns a classifier in conv( H ) . In thefollowing, we first assume the dual parameter λ is fixed, and describe the primal solution as a classification oracle. Weconsider both plugin and weighted ERM. In brief, the plugin estimator first proceeds assuming η ( x ) is known, then we plugin the empirical estimator ˆ η ( x ) in its place. The plugin approach has the benefit of low computational complexityonce fixed. On the other hand, the weighted ERM estimator requires the solution of a weighted classification problemin each round, but avoids the need for estimating ˆ η ( x ) . 4airness with Overlapping Groups A P
REPRINT
Algorithm 1:
GroupFair , Group-fair classification with overlapping groups,
Input: ψ : [0 , K × K → [0 , , Φ : [0 , K × K × ([0 , K × K ) G fair → [0 , J samples { ( x , y ) , . . . , ( x n , y n ) } .Initialize λ ∈ [0 , B ] J ; for t = 1 , . . . , T do h t ← MinOracle h ∈H ( L ( h, λ t ) , z n ) ; λ t +1 ← Update t ( λ t , Φ( (cid:98) C [ h t ] , { (cid:98) C g [ h t ] } g ∈G fair ) − ε ) ; end ¯ h T ← T (cid:80) Tt =1 h t , ¯ λ T ← T (cid:80) Tt =1 λ t ; return (¯ h T , ¯ λ T ) In the weighed ERM approach we parametrize h : X → [ K ] by a function class F of functions f : X → R K . Theclassification is the argmax of the predicted vector, h ( x ) = argmax j ( f ( x ) j ) , so we denote the set of classifiers as H werm = argmax ◦F . The following special case of Definition 1 in (Ramaswamy and Agarwal, 2016) outlines therequired conditions for weighted multiclass classification calibration. This is commonly referred to as cost-sensitiveclassification (Agarwal et al., 2018) when applied to binary classification. Definition 4.1 ( W -calibration (Ramaswamy and Agarwal, 2016)) . Let W ∈ R K × K + . A surrogate function L : R K → R K + is said to be W -calibrated if ∀ p ∈ ∆ K : inf u :argmax( u ) / ∈ argmin k ( p (cid:62) W ) k p (cid:62) L ( u ) > inf u p (cid:62) L ( u ) . Note that the weights are sample (group) specific – which, while uncommon, is not new, e.g., Ávila Pires et al. (2013).
Proposition 4.1.
The weighted ERM estimator for average fairness violation is given by: h ( x ) =argmax j ( f ∗ ( x ) j ) , f ∗ = min f ∈F ˆ L ( f ); where ˆ L ( f ) = ˆ E [ y T L ( f )] is a multiclass classification surrogate for theweighted multiclass error with group-dependent weights ∀ a ∈ A W ( x ) = D + J (cid:88) j =1 λ j (cid:18) U j − (cid:88) g ∈G fair a ∈ g ˆ π ( g ) V gj (cid:19) . (2) The plugin hypothesis class are the weighted classifiers, identified by Theorem 3.1 as H plg = { h ( x ) =argmin j ∈ [ K ] (ˆ η ( x ) (cid:62) B ( x )) j : B ( x ) ∈ R K × K } . Here, we focus on the average violation case only. By simply-reordering terms, the population problem can be determined as follows. Proposition 4.2.
The plug-in estimator for average fairness violation is given by ˆ h ( x ) = argmin k ∈ [ K ] ( η ( x ) (cid:62) W ( x )) k ,where W ( x ) is defined in (2) . GroupFair , a General Group-Fair Classification Algorithm
We can now present
GroupFair , a general algorithm for group-fair classification with overlapping groups, as outlinedin Algorithm 1. As outlined, our approach proceeds in rounds, updating the classifier oracle and the dual variable.Interleaved with the primal update is a dual update
Update t ( λ , v ) via gradient descent on the dual variable. Theresulting classifier is the average over the oracle classifiers. Recovery of existing methods.
When the groups are non-overlapping,
GroupFair with the Plugin oracle and projectedgradient ascent update recovers FairCOCO (Narasimhan, 2018). Similarly, when the groups are non-overlapping,and the labels are binary,
GroupFair with the weighted ERM oracle and exponentiated gradient update recoversFairReduction (Agarwal et al., 2018) (see also Table 2). Importantly,
GroupFair enables a straightforward extension tooverlapping groups. 5airness with Overlapping Groups
A P
REPRINT
MinOracle h ∈H ( L ( h, λ t ) , z n ) Update t ( λ , v ) FairReduction H ◦ argmin f ∈F ˆ L ( f ) B exp(log λ − η t v ) B − (cid:80) Mj =1 λ j + (cid:80) Mj =1 exp(log λ i − η t v i ) FairCOCO plugin(ˆ η , (ˆ π g ) g ∈G fair , ψ, Φ , λ t ) proj [0 ,B ] M ( λ + η t v ) Table 2: The oracles shown are plugin (6) and ERM on the reweighted ˆ L (7). H = [argmax k ∈ [ K ] ( · ) k ] converts afunction X → R K to a classifier. In FairCOCO, ˆ η is estimated from samples z n/ = { ( x , y ) , . . . , ( x n/ , y n/ ) } and all of the other probability estimates (ˆ π g ) g and { (cid:98) C g [ h t ] } g are estimated from z n/ = z n \ z n/ . Here we discuss the consistency of the weighted ERM and the plugin approaches. For any class H = { h : X → [ K ] } ,denote H k = { { h ( x )= k } : h ∈ H} . We assume WLOG that VC( H ) = . . . = VC( H K ) and denote this quantity as VC( H ) . Next, we give a theorem relating the performance and satisfaction of constraints of an empirical saddle pointto an optimal fair classifier. Theorem 5.1.
Suppose ψ : [0 , K × K → [0 , and Φ : [0 , K × K × ([0 , K × K ) G fair → [0 , J are ρ -Lipschitz w.r.t. (cid:107) · (cid:107) ∞ . Recall ˆ L ( h , λ ) = ˆ E ( h ) + λ (cid:62) ( ˆ V ( h ) − ε ) . Define γ ( n (cid:48) , H , δ ) = (cid:113) VC ( H ) log( n )+log(1 /δ ) n . If n min =min g ∈G fair n g , ε = Ω ( ργ ( n min , H , δ )) then w.p. − δ :If (¯ h , ¯ λ ) is a ν -saddle point of max λ ∈ [0 ,B ] J min h ∈ conv H ˆ L ( h , λ ) , in the sense that max λ ∈ [0 ,B ] J ˆ L (¯ h , λ ) − min h ∈ conv( H ) ˆ L ( h , ¯ λ ) ≤ ν , and h ∗ ∈ conv( H ) satisfies V ( h ∗ ) ≤ , then E (¯ h ) ≤ E ( h ∗ ) + ν + O ( ργ ( n, H , δ )) , (cid:107)V (¯ h ) (cid:107) ∞ ≤ νB + O ( ργ ( n min , H , δ )) + ε. Thus, as long as we can find an arbitrarily good saddle point, which weighted ERM grants if H werm is expressiveenough while having finite VC dimension, then we obtain consistency. A saddle point can be found by running agradient ascent algorithm on λ confined to [0 , B ] J , which repeatedly computes h t = argmin h ∈H ˆ L ( h, λ t ) ; the final (¯ h , ¯ λ ) are the averages of the primal and dual variables computed throughout the algorithm.Although Theorem 5.1 captures the spirit of the argument for the plugin algorithm, it only applies naturally to theweighted ERM algorithm. This is because the plugin algorithm is solving a subtly different minimization problem: itreturns h t as the population minimum , if the estimated regression function ˆ η replaces the true regression function . Theorem 5.2.
With probability at least − δ , if projected gradient ascent is run as Update t ( λ , v ) = proj [0 ,B ] J ( λ + η v ) for T iterations with step size η = B √ T and for t = 1 , . . . , T, h t = plugin(ˆ η , (ˆ π g ) g ∈G fair , ψ, Φ) , letting ρ =max {(cid:107) ψ (cid:107) , (cid:107) φ (cid:107) , . . . , (cid:107) φ M (cid:107) } , ρ g = (cid:80) Jj =1 (cid:107) V gj (cid:107) ∞ , ρ X = (cid:107) D (cid:107) ∞ + (cid:80) Jj =1 (cid:107) U j (cid:107) ∞ , ∆ η = E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) , ˇ n =min g ∈G fair n g , then κ := O Jρ (cid:115) K log(ˇ n ) + log( |G fair | K δ )ˇ n + ∆ η ρ X + (cid:88) g ∈G fair ρ g π g + (cid:115) log( |G fair | δ ) n (cid:88) g ∈G fair ρ g π g = ⇒ E ψ (¯ h T ) ≤ E ∗ ψ + JB √ T + O ( BJκ ) , (cid:107)V φ (¯ h T ) (cid:107) ∞ ≤ J √ T + O ( Jκ ) . A key point in the presented analyses (for both procedures) is that the dominating statistical properties depend on thenumber of fairness groups. We note that |G fair | (cid:28) |G intersectional | = |A| for the independent case, so this significantlyimproves results. More broadly, we conjecture that the statistical bounds depend on min( |G fair | , |G intersectional | ) , and leavethe details to future work. We also note the statistical dependence on the size of the smallest group. This seems to beunavoidable, as we need an estimate of the group fairness violation in order to control it. To this end, group violationsmay be scaled by group size, which leads instead to a dependence on the VC dimension of G fair , improving statisticaldependence with small groups at the cost of some fairness Kearns et al. (2018). We expect that the bounds may beimproved by a more refined analysis, or modified algorithms with stronger assumptions. We leave this detail to futurework. 6airness with Overlapping Groups A P
REPRINT
Table 3: Average training times (averaged over the training sessions for each fairness parameter). The Plugin Oracle issignificantly faster than other approaches. Independent GerrymanderingC& C Adult German Law school Adult German Law school
Weighted-ERM
Plugin
Regularizer
Kearns et al.
N/A N/A N/A N/A 2213.7 821.5 s 1674.4 s
Recent work by Foulds et al. (2018); Kearns et al. (2018) and Hebert-Johnson et al. (2018) were among the first todefine and study intersectional fairness with respect to parity and calibration metrics respectively. Narasimhan (2018)provide a plugin algorithm for group fairness and generalization guarantees for the unrestricted case. (Menon andWilliamson, 2018) considered Bayes optimality of fair binary classification where the sensitive attribute is unknown attest time, using an additional sensitive attribute regressor. Cotter et al. (2018) provide proxy-Lagrangian algorithmwith generalization guarantees, assuming proxy constraint functions which are strongly convex, and argue that bettergeneralization is achieved by reserving part of the dataset for training primal parameters and part of the dataset fortraining dual parameters. Celis et al. (2018) provide an algorithm with generalization guarantees for independent groupfairness based on solving a grid of interval constrained programs; their and Narasimhan (2018)’s work are most similarto ours.
We consider demographic parity as the fairness violation, i.e., φ ± DP = ± ( C g , + C g , − C , + C , ) − ν, combinedwith 0-1 error ψ ( C ) = C + C as the error metric. All labels and protected attributes are binary or binarized. Weuse the following datasets (details in the appendix): (i) Communities and Crime, (ii) Adult census, (iii) German creditand (iv) Law school. Evaluation Metric.
We compute the "fairness frontier" of each method – that is, we vary the constraint level ν . Weplot the fairness violation and the error rate on the train set and a test set. The fairness violation for demographic parityis defined by fairviol DP = max g ∈G fair | (cid:98) C g , + (cid:98) C g , − (cid:98) C , − (cid:98) C , | Observe that on the training set, it is always possible to achieve extreme points by ignoring either the classification erroror the fairness violation.
Baseline:
Regularizer is a linear classifier implemented by using Adam to minimize logistic loss plus the followingregularization function: ρ M (cid:88) j =1 (cid:32) (cid:80) i :( z i ) j =1 σ ( w (cid:62) x i ) |{ i : ( z i ) j = 1 }| − (cid:80) ni =1 σ ( w (cid:62) x i ) n (cid:33) (3)where σ ( r ) = e − r is the sigmoid function. This penalizes the squared differences between the average predictionprobabilities for each group and the overall average prediction probability. Other existing methods we are aware of areeither not applicable to overlapping groups, or are special cases of GroupFair . Experiment 1: Independent group fairness.
We consider independent group fairness, defined by consideringprotected attributes separately. Our results compare extensions of FairCOCO (Narasimhan, 2018) and a FairRe-duction (Agarwal et al., 2018), existing special cases of
GroupFair using the plugin and weighted ERM oraclesrespectively. Results are shown in Figure 1. We further present the differences in training time in 3. On all datasets,the variants of
GroupFair are much more effective than a generic regularization approach. However,
Plugin seemsto violate fairness more often at test time – perhaps this is due to the (cid:107) ˆ η − η (cid:107) term in the generalization bound inTheorem 5.2. At the same time, Plugin is almost 2 orders of magnitude faster, since its
MinOracle essentially has aclosed-form solution, while
Weighted-ERM has to solve a new ERM problem in each iteration.
Experiment 2: Gerrymandering group fairness.
Unfortunately, intersectional fairness is not statistically estimable inmost cases as most intersections are empty. As a remedy, (Kearns et al., 2018) propose max-violation fairness constraints7airness with Overlapping Groups
A P
REPRINT
Figure 1: Experiments on independent group fairness, showing fairness frontier. The pareto frontier closest to thebottom left represent the best fairness/performance tradeoff.Figure 2: Experiments on gerrymandering group fairness. The pareto frontier closest to the bottom left represent thebest fairness/performance tradeoff.over G gerrymandering , where each group is weighed by group size, i.e., max g ∈G gerrymandering | g | n | (cid:98) C g , + (cid:98) C g , − (cid:98) C , − (cid:98) C , | , soempty groups are removed, and small groups have relatively low influence unless there is a very large fairness violation.We denote the approach of Kearns et al. (2018) as Kearns et al.
This approach is closely related to
Weighted-ERM but searches for the maximally violated group by solving a cost-sensitive classification problem and uses fictitiousplay between λ and h . For the Plugin and
Weighted-ERM approaches, we optimize the cost function directly usinggradient ascent, precomputing the gerrymandering groups present in the data. Results are shown in Figure 2. We furtherpresent the differences in training time in Table 3. The results are roughly equivalent in terms of performance, however,both the
Weighted-ERM and
Plugin approach are 1-2 orders of magnitude faster than
Kearns et al.
This manuscript considered algorithmic fairness across multiple overlapping groups simultaneously. Using a prob-abilistic population analysis, we present the Bayes-optimal classifier, which motivates a general-purpose algorithm,
GroupFair . Our approach unifies a variety of existing group-fair classification methods and enables extensions toa wide range of non-decomposable multiclass performance metrics and fairness measures. Future work will includeextensions beyond linear metrics, to consider more general fractional and convex metrics. We also wish to explore morecomplex prediction settings beyond classification.
References
Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fairclassification. In
International Conference on Machine Learning , pages 60–69, 2018.Boucheron, Stéphane, Bousquet, Olivier, and Lugosi, Gábor. Theory of classification: a survey of some recent advances.
ESAIM: PS , 9:323–375, 2005. doi: 10.1051/ps:2005018. URL https://doi.org/10.1051/ps:2005018 .Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classifica-tion. In Sorelle A. Friedler and Christo Wilson, editors,
Proceedings of the 1st Conference on Fairness, Accountabilityand Transparency , volume 81 of
Proceedings of Machine Learning Research , pages 77–91, New York, NY, USA,23–24 Feb 2018. PMLR. URL http://proceedings.mlr.press/v81/buolamwini18a.html .8airness with Overlapping Groups
A P
REPRINT
Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classification.
Data Min. Knowl.Discov. , 21:277–292, 09 2010. doi: 10.1007/s10618-010-0190-x.L. Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K. Vishnoi. Classification with Fairness Constraints: AMeta-Algorithm with Provable Guarantees. arXiv e-prints , art. arXiv:1806.06055, June 2018.Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv e-prints , art. arXiv:1703.00056, Feb 2017.Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, andSeungil You. Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints. arXiv e-prints , art. arXiv:1807.00028, June 2018.Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness.In
Proceedings of the 3rd Innovations in Theoretical Computer Science Conference , ITCS ’12, pages 214–226,New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1115-1. doi: 10.1145/2090236.2090255. URL http://doi.acm.org/10.1145/2090236.2090255 .James Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. An Intersectional Definition of Fairness. arXive-prints , art. arXiv:1807.08362, Jul 2018.Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In
Proceedings of the30th International Conference on Neural Information Processing Systems , NIPS’16, pages 3323–3331, USA, 2016a.Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=3157382.3157469 .Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In
Proceedings of the30th International Conference on Neural Information Processing Systems , NIPS’16, pages 3323–3331, USA, 2016b.Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=3157382.3157469 .Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the(Computationally-identifiable) masses. In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35thInternational Conference on Machine Learning , volume 80 of
Proceedings of Machine Learning Research , pages1939–1948, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/hebert-johnson18a.html .Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach.pages 643–650, 12 2011. doi: 10.1109/ICDMW.2011.83.Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing andlearning for subgroup fairness. In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35th InternationalConference on Machine Learning , volume 80 of
Proceedings of Machine Learning Research , pages 2564–2572,Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/kearns18a.html .Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent multilabelclassification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 28 , pages 3321–3329. Curran Associates, Inc., 2015.Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian,Trevor Back, Mary Chesus, Greg C Corrado, Ara Darzi, et al. International evaluation of an ai system for breastcancer screening.
Nature , 577(7788):89–94, 2020.Albert Meijer and Martijn Wessels. Predictive policing: Review of benefits and drawbacks.
International Journal ofPublic Administration , 42(12):1031–1039, 2019.Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification. In Sorelle A. Friedler andChristo Wilson, editors,
Proceedings of the 1st Conference on Fairness, Accountability and Transparency , volume 81of
Proceedings of Machine Learning Research , pages 107–118, New York, NY, USA, 23–24 Feb 2018. PMLR. URL http://proceedings.mlr.press/v81/menon18a.html .Harikrishna Narasimhan. Learning with complex loss functions and constraints. In Amos Storkey and FernandoPerez-Cruz, editors,
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics ,volume 84 of
Proceedings of Machine Learning Research , pages 1646–1654, Playa Blanca, Lanzarote, CanaryIslands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/narasimhan18a.html .9airness with Overlapping Groups
A P
REPRINT
Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithmsfor complex performance measures. In
Proceedings of the 32nd International Conference on Machine Learning(ICML-15) , pages 2398–2407, 2015.Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss matrices.
The Journal ofMachine Learning Research , 17(1):397–441, 2016.Chih-Fong Tsai and Jhen-Wei Wu. Using neural network ensembles for bankruptcy prediction and credit scoring.
Expert systems with applications , 34(4):2639–2649, 2008.Xiaoyan Wang, Ran Li, Bowei Yan, and Oluwasanmi Koyejo. Consistent classification with generalized metrics. arXivpreprint arXiv:1908.09057 , 2019.Bowei Yan, Sanmi Koyejo, Kai Zhong, and Pradeep Ravikumar. Binary classification with karmic, threshold-quasi-concave metrics. In
Proceedings of the 35th International Conference on Machine Learning , volume 80, pages5531–5540. PMLR, 2018.Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In SanjoyDasgupta and David McAllester, editors,
Proceedings of the 30th International Conference on Machine Learning ,volume 28 of
Proceedings of Machine Learning Research , pages 325–333, Atlanta, Georgia, USA, 17–19 Jun 2013.PMLR. URL http://proceedings.mlr.press/v28/zemel13.html .Bernardo Ávila Pires, Csaba Szepesvari, and Mohammad Ghavamzadeh. Cost-sensitive multiclass classification riskbounds. In Sanjoy Dasgupta and David McAllester, editors,
Proceedings of the 30th International Conference onMachine Learning , volume 28 of
Proceedings of Machine Learning Research , pages 1391–1399, Atlanta, Georgia,USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/avilapires13.html .10airness with Overlapping Groups
A P
REPRINT
AppendixA Bayes optimal
Theorem 3.1.
Under Assumption 2.1 and Assumption 3.1, if (1), i.e., h ∗ ∈ argmin h E ( h ) s.t. V ( h ) ≤ , is feasible (i.e., a solution exists), the Bayes-optimal classifier is given by h ∗ ( x ) = h ∗ ( z , a ) = β a h ( x ) + (1 − β a ) h ( x ) , where β a ∈ (0 , , ∀ a ∈ A and h i ( x ) are weighted classifiers with weights {{ W i, a } i ∈{ , } } a ∈A . Proof.
The key idea of the proof is to exploit the problem representation in terms of confusion matrices. The proof hastwo main steps (i) population analysis for feasible confusion matrices, and (ii) plug-in of the classifiers that achieve theBayes optimal confusion.
Confusion space.
As the first step, let C g = { C g ( h ) | h ∈ H} be all group g specific confusion matrices, and let C G fair = (cid:81) g ∈G fair C g be the product space of all confusion matrices corresponding to fair groups associated with agiven instance of the problem. Similarly, let C A = (cid:81) g ∈G intersectional C g be the product space of all confusion matricescorresponding to intersectional groups. A standard property of confusion matrices is that each C g is a convexset Narasimhan et al. (2015); Narasimhan (2018); Wang et al. (2019). Thus, each C ∈ C g can be described as a mixtureof two boundary points, i.e., ∀ C ∈ C g ∃ C , C ∈ ∂ C g , β ∈ [0 , , s.t. C = β C + (1 − β ) C Another useful fact is that all confusion matrices on the boundary can be achieved by a weighted classifier Narasimhanet al. (2015); Narasimhan (2018); Wang et al. (2019). This fact follows from the convexity of the set C g , and is simplya dual representation – via support functions, i.e., ∀ C ∈ ∂ C g , ∃ W s.t. C = Conf g ( h ∗ ) , where h ∗ ∈ argmax h ∈H (cid:104) W , Conf g ( h ) (cid:105) , and where, for notation clarity, we have Conf ( h ) as the confusion matrix of classifier h , and Conf g ( h ) as the group-restricted confusion matrix. Further, the solution h ∗ can be represented as a weighted classifier (Definition 3.2)Narasimhan (2018); Wang et al. (2019). Population confusion problem.
Recall that the population confusion can be decoposed into their intersectionalcounterparts C = (cid:80) a ∈G intersectional P ( a ) C a . Similarly, each overlapping group confusion can be decomposed using theintersection confusions as C g ∈ C G fair , C g = (cid:80) a ∈G intersectional P ( a | g ) C a .As the overall metric is a function of confusion matrices only, we can re-state (1) as the equivalent confusion problem(with slight abuse of notation) for any G fair as: C ∗ , { C g, ∗ } = argmin ψ ( C ) s.t. Φ( C , { C g } ) ≤ , C = (cid:88) a ∈G intersectional P ( a ) C a C g = (cid:88) a ∈G intersectional P ( a | g ) C a C a = Conf a ( h ) . After substituting the population C and the group confusions C g with the presented linear functions of C a , this isequivalent to the problem { C a, ∗ } = argmin ψ ( { C a } ) s.t. Φ( { C a } ) ≤ , C a = Conf a ( h ) . Here, we have used the linearity of the cost functions ψ and Φ , and the linearity of the confusion matrix decompositionsinto intersectional confusion matrices. Putting it together.
The final step is noting that a solution, if it exists, can be represented by feasible intersectionalconfusion matrices { C a, ∗ } , and in turn, each intersectional confusion matrix can be recovered as a weighted average oftwo intersectional boundary confusion matrices. Thus the corresponding classifiers can be recovered by a mixture oftwo weighted classifiers. 11airness with Overlapping Groups A P
REPRINT
B Independent vs. intersectional group fairness
Proposition 3.2.
For any G fair that satisfies assumption 2.1, suppose φ : [0 , K × K × [0 , K × K → R + is quasiconcavein its second argument, φ ( C , C g ) ≤ ∀ g ∈ G intersectional = ⇒ φ ( C , C g ) ≤ ∀ g ∈ G fair . The converse does not hold.
Proof. (For the forward direction)Recall that f is quasiconcave if f ( (cid:80) i λ i z i ) ≤ max i { f ( z i ) } . When φ is quasiconvex, for any G fair , we can compute φ ( C , C g ) = φ ( C , (cid:80) a ∈G intersectional λ a C a ) ≤ max a ∈G intersectional φ ( C , C a ) , where λ a are linear weights (corresponding toinclusion probabilities).Since φ ( C , C a ) ≤ by the claim, it follows that φ ( C , C a ) ≤ ∀ a ∈ G intersectional = ⇒ φ ( C , C g ) ≤ ∀ g ∈ G fair . Converse.
Though the above applies to any quasiconcave metric, in this manuscript we mainly consider linear metrics.As a corollary, intersectional group fairness with respect to common fairness metrics such as demographic parity orequal opportunity implies independent group fairness. A simple xor-like example from (Kearns et al., 2018) shows thatthe converse is not true.We provide another counterexample to the converse, showing a gap between independent and intersectional demographicparity (DP) group fairness, on an example with more realistic structure.
Example B.1.
Let A , A , A be binary attributes and { A m } denote the event { A m = 1 } . If P ( Y ) = P ( A ) = P ( A ) = P ( A ) = 0 . , A , A , A are both independent and conditionally independent given Y , and P ( A m | Y ) = 0 . ,then for every P, N ⊂ { , , } with P ∩ N = ∅ P ( Y | ∩ i ∈ P A i , ∩ j ∈ N ¯ A j ) = 0 . . | P | (0 . | N | . Proposition B.1.
An optimal (DP) intersectionally fair ˆ Y has, over every possible subgroup G = ∩ i ∈ P A i ∩ j ∈ N A j , P ( ˆ Y | G ) = 0 .
384 = 0 . . (0 . and has an error of . .On the other hand, an optimal (DP) independently fair classifier has P ( ˆ Y | A , A , A ) = 0 . , P ( ˆ Y | A i , A j , ¯ A k ) =0 . , P ( ˆ Y | A i , ¯ A j , ¯ A k ) = 0 . , P ( ˆ Y | ¯ A i , ¯ A j , ¯ A k ) = 0 . and has an error of . . Interestingly, even though P ( Y | A , A , A ) = 0 . and P ( Y | ¯ A , ¯ A , ¯ A ) = 0 . have the highest and lowestprobabilities, the reverse is true of the predictor ˆ Y – it sacrifices accuracy on these groups to obtain higher accuracy onmixed positive/complement intersections.Here we set up and discuss the example in 3.2 in more detail. First we begin with a rigorous and more generaldescription of the structure of the example – here, one can think of a binary attribute as being synonymous with apartition with two sections. The first section corresponds to individuals with a value of 1 for that attribute and the othersection to those with a value of 0. Assumption B.2 (Independence) . Assume that the binary attributes A , A , . . . , A M and label Y satisfy:1. A , . . . , A M are independent.2. A , . . . , A M are independent conditioned on Y .In the following, when A j is used to denote an event inside a probability, it refers to the event { A j = 1 } . ¯ A j refers tothe event { A j = 0 } . We also use the notation A j = A j and ¯ A j = A j . Proposition B.2.
For every j = 1 , . . . , M, define q j = P ( A j | Y ) and a j = P ( A j ) . Then, under Assumption B.2,for any index set J = { j , j , . . . , j J } and ( b j ) j ∈ J ∈ { , } J , P ( Y | A b j j , j ∈ J ) = J (cid:89) k =1 (cid:18) q j k a j k (cid:19) b k (cid:18) − q j k − a j k (cid:19) − b k A P
REPRINT
Proof. P ( Y | A b j , . . . , A b J j J ) = P ( Y, A b j , . . . , A b J j J ) P ( A b j , . . . , A b J j J )= P ( Y ) J (cid:89) k =1 P ( A b k j k | Y, A b j , . . . , A b k − j k − ) P ( A b k j k | A b j , . . . , A b k − j k − )= P ( Y ) J (cid:89) k =1 P ( A b k j k | Y ) P ( A b k j k )= P ( Y ) J (cid:89) k =1 (cid:18) q j k a j k (cid:19) b k (cid:18) − q j k − a j k (cid:19) − b k . The third line follows by independence, Assumption B.2.The idea behind the above proposition is that with the independence assumption B.2, the structure of P ( Y | A b , . . . , A b M M ) is such that we have P ( Y ) scaled either by q j /a j or (1 − q j ) / (1 − a j ) depending on whether we are in A j or ¯ A j . This in a sense makes the effects of protected attributes “pile on.” If we assume WLOG that q j /a j ≥ , then (1 − q j ) / (1 − a j ) ≤ . Example B.3.
Suppose that M = 3 , P ( Y ) = 0 . , and for every j = 1 , , , a j = P ( A j ) = 0 . and q j = P ( A j | Y ) = 0 . . (This is possible because for every J, ≤ P ( Y | A j , j ∈ J ) ≤ , aka is a well defined probability.)Applying Proposition B.2 noting q j a j = 1 . , − q j − a j = 0 . , P ( Y | A ) = P ( Y | A ) = P ( Y | A ) = 0 . · . . ,P ( Y | ¯ A ) = P ( Y | ¯ A ) = P ( Y | ¯ A ) = 0 . · . . ,P ( Y | A , A ) = P ( Y | A , A ) = P ( Y | A , A ) = 0 . · (1 . = 0 . ∀ ≤ j, k ≤ , P ( Y | A j , ¯ A k ) = 0 . · . · . . ∀ ≤ j, k ≤ , P ( Y | ¯ A j , ¯ A k ) = 0 . · . · . . P ( Y | A , A , A ) = 0 . · (1 . = 0 . ∀ ≤ i, j, k ≤ , P ( Y | A i , A j , ¯ A k ) = 0 . · (1 . · . . ∀ ≤ i, j, k ≤ , P ( Y | A i , ¯ A j , ¯ A k ) = 0 . · . · (0 . = 0 . P ( Y | ¯ A , ¯ A , ¯ A ) = 0 . · (0 . = 0 . . Fact B.4.
Assuming Assumption B.2 and the accuracy metric, the optimal intersectionally fair predictor ˆ Y assigns theprobabilities ∀ b ∈ { , } M , P ( ˆ Y | A b , . . . , A b M M ) = wmedian A P ( Y ) M (cid:89) j =1 (cid:18) q j a j (cid:19) b j (cid:18) − q j − a j (cid:19) − b j where the weighted median wmedian A of a set of M numbers { r b ≤ . . . ≤ r b M : b i ∈ { , } M } is r b i ∗ , i ∗ = min { i ∈ N : (cid:88) k ≥ i P ( A b k , . . . , A b kM M ) ≥ . } . (Proof sketch). By thinking about it (or taking subgradient of E | Y − ˆ Y | ), since we have the freedom to pick anyconstant to be the one to assign to every P ( ˆ Y | A b , . . . , A b M M ) , we get the weighted median formula. Fact B.5.
In example B.3, using Fact B.4 (an) optimal intersectionally fair predictor assigns P ( ˆ Y | A b , A b , A b ) =0 . and has an error of
18 ( | . − . | + 3 · | . − . | + | . − . | ) = 0 . . A P
REPRINT
On the other hand, an optimal independently group fair predictor assigns P ( Y | A , A , A ) = 0 . · (1 . = 0 . ∀ ≤ i, j, k ≤ , P ( Y | A i , A j , ¯ A k ) = 0 . · (1 . · . . ∀ ≤ i, j, k ≤ , P ( Y | A i , ¯ A j , ¯ A k ) = 0 . · . · (0 . = 0 . P ( Y | ¯ A , ¯ A , ¯ A ) = 0 . · (0 . = 0 . . This predictor has an error of ( | . − . | + | . − . | ) = 0 . . This is strictly less than the optimalintersectional error . , i.e. there is a gap. Proof.
By basically the same argument as for the intersectional case, it is optimal to have P ( ˆ Y | A ) = P ( ˆ Y | ¯ A ) be the median of P ( Y | A ) , P ( Y | ¯ A ) . Now we just need to verify that ˆ Y as defined above is independently group fair. P ( Y | A i ) = 14 (cid:0) P ( Y | A i , A j , A k ) + P ( Y | A i , ¯ A j , A k ) + P ( Y | A i , A j , ¯ A k ) + P ( Y | A i , ¯ A j , ¯ A k ) (cid:1) = 14 (0 .
464 + 2(0 . . . P ( Y | ¯ A i ) = 14 (cid:0) P ( Y | ¯ A i , A j , A k ) + P ( Y | ¯ A i , ¯ A j , A k ) + P ( Y | ¯ A i , A j , ¯ A k ) + P ( Y | A i , ¯ A j , ¯ A k ) (cid:1) = 14 (0 .
576 + 2(0 . . . . Since i ∈ { , , } is arbitrary independent group fairness is satisfied. C Consistency and Generalization
Theorem 5.2.
With probability at least − δ , if projected gradient ascent is run ( Update t ( λ , v ) = proj [0 ,B ] J ( λ + ηv ) )for T iterations with step size η = B √ T and for t = 1 , . . . , T, h t = plugin(ˆ η , (ˆ π g ) g ∈G fair , ψ, Φ) , letting ρ =max {(cid:107) ψ (cid:107) , (cid:107) φ (cid:107) , . . . , (cid:107) φ J (cid:107) } , then U ψ (¯ h T ) ≤ U ∗ ψ + JB √ T + ((1 + J ) B + 1) ρ (cid:115) K log(2 n min ) n min + (cid:115) log(2(1 + |G fair | ) K /δ ) n min + E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) B ρ X + (cid:88) g ∈G fair + ρ g π g + 2 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g Bπ g (cid:107)V Φ (¯ h T ) (cid:107) ∞ ≤ J √ T + 4(4(1 + J ) + 1) ρ (cid:115) K log(2 n min ) n min + (cid:115) log(2( | |G fair | ) K /δ ) n g + 4 E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) ρ X + (cid:88) g ∈G fair ρ g π g + 8 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g π g . Proof.
First step is to extract the error incurred by plugging in ˆ η rather than η . Denoting ˆ h = plugin(ˆ η , (ˆ π g ) g , ψ, Φ , λ ) and n g = |{ i : x i ∈ g }| so that ˆ π g = n g n , ˆ h ( x ) = argmin k ∈{ ,...,K } (cid:26) ˆ η ( x ) (cid:62) (cid:20) D + J (cid:88) l =1 λ l (cid:0) U l − (cid:88) g ∈G fair x ∈ g ˆ π g V gl (cid:1)(cid:21)(cid:27) k . Denote h = plugin( η , ( π g ) g , ψ, Φ , λ ) . We quantify the discrepancy. Define ˆ k = ˆ h ( x ) and k ∗ = h ( x ) . Also, define M = D + J (cid:88) l =1 λ l (cid:18) U l − (cid:88) g ∈G fair x ∈ g ˆ π g V gl (cid:19) . A P
REPRINT ( η ( x ) (cid:62) M ) ˆ k − ( η ( x ) (cid:62) M ) k ∗ = (ˆ η ( x ) (cid:62) M ) ˆ k + [( η ( x ) − ˆ η ( x )) (cid:62) M ] ˆ k − ( η ( x ) (cid:62) M ) k ∗ ≤ (ˆ η ( x ) (cid:62) M ) k ∗ + [( η ( x ) − ˆ η ( x )) (cid:62) M ] ˆ k − ( η ( x ) (cid:62) M ) k ∗ + ξ = ( η − ˆ η ) (cid:62) M ( e ˆ k − e k ∗ ) + ξ ≤ (cid:107) η − ˆ η (cid:107) (cid:88) g ∈G fair ρ g π g + ρ X B + ξ where ρ g = (cid:80) Jl =1 (cid:107) V gl (cid:107) ∞ , ρ X = (cid:107) D (cid:107) ∞ + (cid:80) Jl =1 (cid:107) V l (cid:107) ∞ and ξ = 2 (cid:113) log( |G fair | /δ ) n (cid:80) g ∈G fair ρ g Bπ g – we are consideringthe fact that | π g − ˆ π g | ≤ (cid:113) log(2 |G fair | /n ) n for every g ∈ G fair with probability − δ/ . Taking expectation, we arrive at L ( C (ˆ h ) , λ ) − L ( C ( h ) , λ ) ≤ E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) (cid:88) g ∈G fair ρ g π g + ρ X B + 2 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g Bπ g . (4)By standard subgradient descent/online learning analysis, if the stepsize η = 1 / ( B √ T ) is used, T max λ ∈ [0 ,B ] M T (cid:88) t =1 ˆ L ( h t , λ ) − T T (cid:88) t =1 ˆ L ( h t , λ t ) ≤ JB √ T because L ( h, · ) is concave and √ J -Lipschitz (all fairness violations assumed to be in [0 , ) and the (cid:96) radius of [0 , B ] J is √ JB .Now we show how good of a saddle point (cid:16) T (cid:80) Tt =1 h t , T (cid:80) Tt =1 λ t (cid:17) =: (¯ h T , ¯ λ T ) for the population problem. Byconvexity of L in the first argument, T max λ ∈ [0 ,B ] M T (cid:88) t =1 ˆ L ( h t , λ ) ≥ max λ ∈ [0 ,B ] M ˆ L (¯ h T , λ ) . Using equation 4 and the fact that h t is the minimizer of L ( C [ h ] , λ t ) , but using ˆ η instead of η , T T (cid:88) t =1 ˆ L ( h t , λ t ) ≤ T T (cid:88) t =1 L ( h t , λ t ) + ˆ L ( h t , λ t ) − L ( h t , λ t ) ≤ T T (cid:88) t =1 min h : X → [0 , L ( h, λ t ) + ˆ L ( h t , λ t ) − L ( h t , λ t )+ B ( ρ X + (cid:88) g ∈G fair ρ g π g ) E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) + ξ ≤ min h : X → [0 , L ( h, λ T ) + 4(1 + J ) Bρ (cid:115) K log( K ) log(2 n min ) n min + (cid:115) log(2( | |G fair | ) K /δ ) n min + B ( ρ X (cid:88) g ∈G fair ρ g π g ) E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) + ξ where the middle term is from Lemma D.1. Let us absorb the error terms into γ . Now we can write: max λ ∈ [0 ,B ] J ˆ L (¯ h T , λ ) − min h : X → [0 , L ( h, λ T ) ≤ JB √ T + γ. Letting ( h ∗ , λ ∗ ) be primal dual optimal, we have ∀ λ ∈ [0 , B ] K , L ( h ∗ , λ ∗ ) ≥ ˆ L (¯ h T , λ ) − JB √ T − γ. (5)The choices λ = 0 and λ = λ ∗ + B e g m , give ˆ U (¯ h T ) ≤ U ( h ∗ ) + γ + JB √ T ˆ V (¯ h T ) k ≤ B (cid:18) JB √ T + 2 γ (cid:19) . A P
REPRINT
By Lemma D.1 ∀ g ∈ G fair , sup h ∈H plg (cid:107) C g [ h ] − ˆ C g [ h ] (cid:107) ∞ ≤ (cid:115) K log(2 n g ) n g + (cid:115) log(2( | |G fair | ) K /δ ) n g =: ζ ( n g ) . we have that with probability ≥ − δ U (¯ h T ) ≤ U ( h ∗ ) + γ + JB √ T + ρζ ( n min ) V (¯ h T ) k ≤ B (cid:18) JB √ T + 2 γ (cid:19) + ρζ ( n min ) . Therefore we obtain the bounds U ψ (¯ h T ) ≤ U ∗ ψ + JB √ T + ((1 + J ) B + 1) ρ (cid:115) K log(2 n min ) n min + (cid:115) log(2(1 + |G fair | ) K /δ ) n min + E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) B ρ X + (cid:88) g ∈G fair + ρ g π g + 2 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g Bπ g (cid:107)V Φ (¯ h T ) (cid:107) ∞ ≤ J √ T + 4(4(1 + J ) + 1) ρ (cid:115) K log(2 n min ) n min + (cid:115) log(2( | |G fair | ) K /δ ) n g + 4 E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) ρ X + (cid:88) g ∈G fair ρ g π g + 8 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g π g . D Estimators
In this section, we give plugin and weighted ERM methods of solving the linear probabilistic minimization problemsarising from the Lagrangian of our fairness problem. For clarity, we go over the choices of cost and constraint matricescorresponding to what we use in our experiments.In our experiments, we maximize accuracy while enforcing independent demographic parity constraints and group-weighted gerrymandering demographic parity constraints. Under the framework of our probabilistic optimizationproblem, the former corresponds to the choice G fair = G independent , and Φ containing the |G independent | = 4 M constraints ∀ g ∈ G independent , ± ( C g + , − C + , ) ≤ ν, where the + subscript denotes summing over indices , in place of + . I.e. for g ∈ G indepdendent , V gg, ± = ± (cid:20) (cid:21) , V g (cid:48) g, ± = for g (cid:54) = g (cid:48) , U g, ± = ± (cid:20) (cid:21) . D = (cid:20) (cid:21) .The latter corresponds to the choice G fair = G gerrymandering , and the |G gerrymandering | constraints ∀ g ∈ G gerrymandering , ± P ( g )( C g + , − C + , ) ≤ ν. This corresponds to, for g ∈ G gerrymandering , V gg, ± = ± P ( g ) (cid:20) (cid:21) , V g (cid:48) g, ± = for g (cid:54) = g (cid:48) , U g, ± = ± P ( g ) (cid:20) (cid:21) .The P ( g ) ’s will cancel out with the P ( g ) ’s in the expressions below.16airness with Overlapping Groups A P
REPRINT
D.1 Plugin Estimator
Using linearity of ψ and φ , if η is known, the population minimizer h ∗ = argmin h : X → [ K ] L ( h, λ ) is deterministic andhas a convenient closed form solution (the same is true of any linear minimization). L ( h, λ ) = (cid:104) D + L (cid:88) l =1 λ l U l , C [ h ] (cid:105) − (cid:88) g ∈G fair L (cid:88) l =1 λ l (cid:104) V gl , C g [ h ] (cid:105) = E (cid:26) (cid:104) D + L (cid:88) l =1 λ l U l , η ( x ) h ( x ) (cid:62) (cid:105) − (cid:88) g ∈G fair L (cid:88) l =1 λ l (cid:10) V gl , { x ∈ g } P ( g ) η ( x ) h ( x ) (cid:62) (cid:11)(cid:27) = E η ( x ) (cid:62) (cid:2) D + L (cid:88) l =1 λ l (cid:18) U l − (cid:88) g ∈G fair x ∈ g P ( g ) V gl (cid:19)(cid:3) h ( x ) . where we noticed that the conditional group confusion equals C g [ h ] = E { x ∈ g } η ( x ) h ( x ) (cid:62) / P ( g ) . Denote π g = P ( g ) for g ∈ G fair as the group probabilities. Thus, the minimizer has the deterministic form h ∗ ( x ) = argmin k ∈{ ,...,K } (cid:26) η ( x ) (cid:62) (cid:2) D + L (cid:88) l =1 λ l (cid:18) U l − (cid:88) g ∈G fair x ∈ g P ( g ) V gl (cid:19)(cid:3)(cid:27) k . (6)Finally, since we do not actually have access to the true η , we replace η with an estimated ˆ η . D.2 Weighted ERM
In the weighed ERM approach (referred to as cost-sensitive classification for the binary case (Agarwal et al., 2018))we parametrize h : X → [ K ] by a function class F of functions : X → R K . The classification is the argmax ofthe predicted vector, h ( x ) = argmax j ( f ( x ) j ) , so we denote the set of classifiers as H werm = argmax ◦F . For astandard classification problem with 0-1 error, minimizing the dataset error (cid:99) err[ h ] = n (cid:80) ni =1 { h ( x i ) (cid:54) = y i } is done byminimizing a surrogate loss (cid:96) : R K × [ K ] → R + , e.g., using softmax cross-entropy, over the dataset, as ˆ E (cid:96) ( f ( x ) , y ) = n (cid:80) ni =1 (cid:96) ( f ( x i ) , y i ) . Then we take h = argmax ◦ f .Let (cid:96) ( s ) ∈ R k be the vector (cid:96) ( s ) k = (cid:96) ( s , k ) .In an analogous manner, we would like to minimize the empirical metric defined by the Lagrangian using a surrogateloss, as min h ∈H werm ˆ L ( h, λ ) = n (cid:88) i =1 e (cid:62) y i (cid:20) n D + L (cid:88) l =1 λ l n (cid:18) U l − (cid:88) g ∈G fair x i ∈ g n g V gl (cid:19)(cid:21) h ( x i ) . where n g = |{ i : x i ∈ g }| , g ∈ G fair are the empirical sizes of each group. Notice it has the form min h ∈H werm n (cid:88) i =1 w (cid:62) i h ( x i ) = n (cid:88) i =1 s ( w i ) w (cid:62) i s ( w i ) h ( x i ) , s ( w i ) = 1 n − K (cid:88) k =1 ( w i ) k . If we interpret − w i s ( w i ) as a probability distribution over labels and s ( w i ) as its weight, then we have min h ˜ E [( − ˜ η ( x )) (cid:62) h ( x )] where ˜ P ( x i ) = s ( w i ) (cid:80) ni =1 s ( w i ) and ˜ η ( x i ) = − w i s ( w i ) .A priori, max k ( w i ) k s ( w i ) ≤ , i.e. max k ( w i ) k (cid:80) Kk =1 ( w i ) k ≤ n − , may not hold. But, since shifting each entry of w i by the same amountdoes not change the initial optimization problem, we can add the constant amount ( n −
1) max k ( w i ) k − (cid:80) Kk =1 ( w i ) k to each entry of w i , after which w i s ( w i ) ≤ .If (cid:96) is a surrogate loss used to minimize the multiclass error, it is assumed that we can minimize E [(1 − η ( x )) h ( x ) ] ifwe minimize E [ η ( x ) (cid:62) (cid:96) ( f ( x ))] and take h = argmax ◦ f . Therefore, we can solve the weighted version by minimizingreweighted surrogate loss: min f ∈F ˜ E [˜ η ( x ) (cid:62) (cid:96) ( f ( x ))] ≡ min f ∈F n (cid:88) i =1 s ( w i ) (cid:18) − w i s ( w i ) (cid:19) (cid:62) (cid:96) ( f ( x )) =: ˆ L ( f ) . (7)This provides a convex surrogate for the original problem of minimizing the empirical Lagrangian.17airness with Overlapping Groups A P
REPRINT
Lemma D.1 (Confusion matrix generalization) . Denote n g as the number of samples belonging to group g for g ∈ G fair ∪ {X } . Then with probability at least − δ , ∀ g ∈ G fair ∪ {X } , sup h ∈ conv H (cid:107) C g [ h ] − (cid:98) C g [ h ] (cid:107) ∞ ≤ (cid:115) VC( H ) log( n g + 1) n g + (cid:115) log((1 + |G fair | ) K /δ ) n g . Proof.
By standard binary classification generalization (Boucheron, Stéphane et al., 2005), with probability at least − δ , sup h ∈ conv H (cid:12)(cid:12)(cid:12) P ( Y = i, h ( X ) = j | g ) − ˆ P ( Y = i, h ( X ) = j | g ) (cid:12)(cid:12)(cid:12) ≤ (cid:115) VC( H ) log( n g + 1) n g + (cid:115) log(1 /δ ) n g . Then we take a union bound over |G fair | confusion matrices and K entries per confusion matrix. Theorem D.2.
Suppose ψ : [0 , K × K → [0 , and Φ : [0 , K × K × ([0 , K × K ) G fair → [0 , L are ρ -Lipschitz w.r.t. (cid:107) · (cid:107) ∞ . Recall ˆ L ( h , λ ) = ˆ E ( h ) + λ (cid:62) ( ˆ V ( h ) − ε ) . Let γ denote the bound in Lemma D.1 that applies to C , γ g thebound that applies to C g , and denote γ G fair = max g ∈G fair γ g . If ε ≥ ργ then with probability − δ :If (¯ h , ¯ λ ) is a ν -saddle point of max λ ∈ [0 ,B ] L min h ∈ conv H ˆ L ( h , λ ) , in the sense that max λ ∈ [0 ,B ] L ˆ L (¯ h , λ ) − min h ∈ conv H ˆ L ( h , ¯ λ ) ≤ ν , and h ∗ ∈ conv H satisfies V ( h ∗ ) ≤ , then E (¯ h ) ≤ E ( h ∗ ) + ν + 2 ργ (8) (cid:107)V (¯ h ) (cid:107) ∞ ≤ νB + ργ G fair + ε. (9)Thus, as long as we can find an arbitrarily good saddle point, which follows from weighted ERM if H werm is expressiveenough while having finite VC dimension, then we obtain consistency. Proof.
By Lemma D.1, with probability − δ, |E ( h ) − ˆ E ( h ) | ≤ ργ, (cid:107)V ( h ) − ˆ V ( h ) (cid:107) ∞ ≤ ργ G fair . (10)Therefore, ˆ V ( h ∗ ) ≤ ε . Using this feasibility to argue the first inequality below: ˆ E (¯ h ) − ˆ E ( h ∗ ) ≤ ˆ E (¯ h ) − ˆ L ( h ∗ , ¯ λ ) = ˆ L (¯ h , − ˆ L ( h ∗ , ¯ λ ) ≤ ν. Then (8) follows from (10) and triangle inequality. For the next part, B ( ˆ V (¯ h ) k − ε ) = ˆ L (¯ h , Be k ) − ˆ L ( h ∗ , ¯ λ ) + ˆ E ( h ∗ ) − ˆ E ( h ) ≤ ν + 1 . This and (10) imply (9).
E Datasets
Here we dicsuss the datasets used and additional experimental details.
Communities and Crime: contains neighborhoods featurized by various statistics pertaining to the neighborhoods,e.g. percent employed in various professions, demographics, rent, etc. The label is whether there is a high ( > -ile)rate of violent crimes per capita. There are n = 1994 samples and N = 12 protected attributes comprising variousracial statistics. Adult census: contains census data for n = 2020 individuals. The label is whether an individual has high income. N = 7 protected attributes comprising age, sex, and different races. German credit: (Dua and Graff, 2017) contains features such as financial holdings, occupation, housing, and reasonfor purchases, and the goal is to predict whether an individual has good credit. Several categorical variables wereconverted to one-hot encodings. There are n = 1000 examples and N = 3 protected attributes corresponding to age,sex, and foreign worker status. Law school: contains n = 1823 students and their gpas, cluster, and LSAT score. The goal is to predict whether thestudent passes the bar, and the protected attributes are age, gender, and family income.18airness with Overlapping Groups A P
REPRINT
For the constraint level ν we vary according a logarithmically spaced grid from . to 1 with 20 points. We set B = 50 for the GroupFair methods. We vary the regularization parameter ρ from . /M to /M across alogarithmically spaced grid with 20 points.The authors of (Kearns et al., 2018) apply fictitious play to the gerrymandering problem, searching for the mostviolated constraint max g ∈G fair n g n | C g , + C g , − C , − C , | in response to the average of the predictors computedso far (if the violation exceeds ν ), and computing the minimizing predictor in response to the average of the dualvariables obtained from the most violated constraints so far. On the other hand, we directly apply our GroupFair framework to their original cost function (seeKearns et al. (2018)) i.e., the problem of maximizing accuracy subject to ∀ g ∈ G fair , | g | n | C g , + C g , − C , − C , | ≤ ν . Both approaches aim to solve this problem.Here are the full (training in addition to test) plots for the independent and gerrymandering experiments.Figure 3: Experiments on independent group fairness. The pareto frontier closest to the bottom left represent the bestfairness/performance tradeoff. 19airness with Overlapping Groups A P