[PDF] Fairness with Overlapping Groups

Abstract

In algorithmically fair prediction problems, a standard goal is to ensure the equality of fairness metrics across multiple overlapping groups simultaneously. We reconsider this standard fair classification problem using a probabilistic population analysis, which, in turn, reveals the Bayes-optimal classifier. Our approach unifies a variety of existing group-fair classification methods and enables extensions to a wide range of non-decomposable multiclass performance metrics and fairness measures. The Bayes-optimal classifier further inspires consistent procedures for algorithmically fair classification with overlapping groups. On a variety of real datasets, the proposed approach outperforms baselines in terms of its fairness-performance tradeoff.

Full PDF

FF AIRNESS WITH O VERLAPPING G ROUPS

A P

REPRINT

Forest Yang ∗ UC Berkeley

Moustapha Cisse

Google Research Accra

Sanmi Koyejo

Google Research Accra & Illinois A BSTRACT

In algorithmically fair prediction problems, a standard goal is to ensure the equality of fairness metricsacross multiple overlapping groups simultaneously. We reconsider this standard fair classiﬁcationproblem using a probabilistic population analysis, which, in turn, reveals the Bayes-optimal classiﬁer.Our approach uniﬁes a variety of existing group-fair classiﬁcation methods and enables extensionsto a wide range of non-decomposable multiclass performance metrics and fairness measures. TheBayes-optimal classiﬁer further inspires consistent procedures for algorithmically fair classiﬁcationwith overlapping groups. On a variety of real datasets, the proposed approach outperforms baselinesin terms of its fairness-performance tradeoff.

Machine learning inform an increasingly large number of critical decisions in diverse settings. They assist medicaldiagnosis (McKinney et al., 2020), guide policing (Meijer and Wessels, 2019), and power credit scoring systems (Tsaiand Wu, 2008). While they have demonstrated their value in many sectors, they are prone to unwanted biases,leading to discrimination against protected subgroups within the population. For example, recent studies have revealedbiases in predictive policing and criminal sentencing systems (Meijer and Wessels, 2019; Chouldechova, 2017). Theblossoming body of research in algorithmic fairness aims to study and address this issue by introducing novel algorithmsguaranteeing a certain level of non-discrimination in the predictions. Each such algorithm relies on a speciﬁc deﬁnitionof fairness, which falls into one of two categories: Individual fairness (Dwork et al., 2012; Zemel et al., 2013) or groupfairness (Calders and Verwer, 2010; Kamishima et al., 2011; Hardt et al., 2016a). The vast majority of the algorithmicgroup fairness literature has focused on the simplest case where there are only two groups. In this paper, we considerthe more nuanced case of group fairness with respect to multiple groups.The simplest setting is the independent case, with only one sensitive attribute which can take multiple values, e.g.,race only. The presence of multiple sensitive attributes (e.g., race and gender simultaneously) leads to non-equivalentdeﬁnitions of group fairness. On the one hand, fairness can be considered independently per sensitive attribute, leadingto overlapping subgroups. For example, consider a model restricted to demographic parity between subgroups deﬁnedby ethnicity. Simultaneously, the model can be constrained to fulﬁll demographic parity between subgroups deﬁnedby gender. We term fairness in this situation independent group fairness . On the other hand, one can consider allsubgroups deﬁned by intersections of sensitive attributes (e.g., ethnicity and gender), leading to intersectional groupfairness . A given algorithm can be independently group fair , e.g., when considering race and gender in isolation, but not intersectionally group fair , e.g., when considering intersections of racial and gender groups. For example, Buolamwiniand Gebru (2018), showed how facial recognition software had a particularly poor performance for black women.This phenomenon, called fairness gerrymandering , has been studied by Kearns et al. (2018). Intersectional fairnessis often considered ideal. However, it comes with major statistical and computational hurdles such as data scarcityat intersections of minority groups, and the potentially exponential number of subgroups. Indeed, current algorithmsconsist of either brute force enumeration or searching via a cost-sensitive classiﬁcation problem, and intersectional ∗ Work completed while an intern at Google Research Accra. a r X i v : . [ c s . L G ] J un airness with Overlapping Groups A P

REPRINT groups are often empty with ﬁnite samples (Kearns et al., 2018). On the other hand, independent group fairness stillprovides a broad measure of fairness and is much easier to enforce.We seek to design unifying statistically consistent strategies for group fairness and to clarify the relationship between theexisting deﬁnitions.

Our main results and algorithms apply to arbitrary overlapping group deﬁnitions. Our contributionsare summarized in the following. • Probabiistic results . We characterize the population optimal (also known as the Bayes-optimal) predictionprocedure for multiclass classiﬁcation, where all the metrics are general linear functions of the confusionmatrix. We consider both overlapping (independent, gerrymandering) and non-overlapping (unrestricted,intersectional) group fairness. • Algorithms and statistical results.

Inspired by the population optimal, we propose simple plugin andweighted empirical risk minimization (ERM) approaches for algorithmically fair classiﬁcation, and prove theirconsistency, i.e., the empirical estimator converges to the population optimal with sufﬁciently large samples.Our general approach recovers existing results for plugin and weighted ERM group-fair classiﬁers. • Comparisons.

We compare independent group fairness to the overlapping case. We show that intersectionalfairness implies overlapping group fairness under weak conditions. However, the converse is not true, i.e.,overlapping fairness may not imply intersectional fairness. This result formalizes existing observations on thedangers of gerrymandering. • Evaluation.

Empirical results are provided to highlight our theoretical claims.Taken together, our results unify and advance the state of the art with respect to the probabilistic, statistical, andalgorithmic understanding of group-fair classiﬁcation. The generality of our approach gives signiﬁcant ﬂexibility to thealgorithm designer when constructing algorithmically-fair learners.

Throughout the paper, we use uppercased bold letters to represent matrices, and lowercased bold letters to representvectors. Let e i represent the i th standard basis whose i th dimension is 1 and 0 otherwise e i = (0 , · · · , , · · · , . Wedenote as the all-ones vector with dimension inferred from context. Given two matrices A , B of same dimension, (cid:104) A , B (cid:105) = (cid:80) i,j a ij b ij is the Frobenius inner product. For any quantity q , ˆ q denotes an empirical estimate. Due tolimited space, proofs are presented in the appendix. Group notation.

We assume M sensitive attributes, where each attribute is indicated by a group {A m } m ∈ [ M ] . Forexample, A may correspond to race, A may correspond to gender, and so on. Combined, the sensitive group indicatoris represented by a M -dimensional vector a ∈ A = A × A × · · · A M . In other words, each instance is associatedwith M subgroups simultaneously. Probabilistic notation.

Consider the multiclass classiﬁcation problem where Z denotes the instance space and Y = [ K ] denotes the output space with K classes. We assume the instances, outputs and groups are samples from a probabilitydistribution P over the domain Y × Z × A . A dataset is given by n samples ( y ( i ) , z ( i ) , a ( i ) ) i.i.d ∼ P , i ∈ [ n ] . To simplifynotation, let X = Z × A , so x = ( z , a ) . Deﬁne the set of randomized classiﬁers H r = { h : X × A → (∆ K ) } , where ∆ q = { p ∈ [0 , q : (cid:80) qi =1 p i = 1 } is the q − dimensional probability simplex. A classiﬁer h is associated with therandom variable h ∈ [ K ] deﬁned by P ( h = k | x ) = h k ( x ) . If h is deterministic, then we can write h ( x ) = e h ( x ) . Confusion matrices.

For any multiclass classiﬁer, let η ( x ) ∈ ∆ K denote the class probabilities for any given instance x and sensitive attribute a , whose k th element is the conditional probability of the output belonging to class k , i.e., η k ( x ) = P ( Y = k | X = x ) . The population confusion matrix is C ∈ [0 , K × K , with elements deﬁned for k, (cid:96) ∈ [ K ] as C k,(cid:96) = P ( Y = k, h = (cid:96) ) , or equivalently, C k,(cid:96) = (cid:90) x η k ( x ) h (cid:96) ( x ) d P ( x ) . Group-speciﬁc confusion matrices.

Let G represent a set of subsets of the instances, i.e., potentially overlappingpartitions of the instances X . We leave G as generic for now, and will specify cases speciﬁc to fairness in the following.Given any group g ∈ G , we can deﬁne the group-speciﬁc confusion matrix C g ∈ [0 , K × K , with elements deﬁned for k, (cid:96) ∈ [ K ] , where C gk,(cid:96) = (cid:90) x η k ( x ) h (cid:96) ( x ) d P ( x | x ∈ g ) . A P

REPRINT

We will abbreviate the event { x ∈ g } to simply g when it is clear from context. Let π g = P ( X ∈ g ) be theprobability of group g . It is clear that when the groups G form a partition, i.e., a ∩ b = ∅ ∀ a, b ∈ G and (cid:83) g ∈G g = X ,the population confusion may be recovered by a weighted average of group confusions, C = (cid:80) g ∈G π g C g . Let ω k = P ( Y = k ) = (cid:80) (cid:96) C k,(cid:96) be the probability of label k , and ω gk = P ( Y = k | X ∈ g ) = (cid:80) (cid:96) C gk,(cid:96) be the probabilityof label k given group g . The sample confusion matrix is deﬁned as (cid:98) C [ h ] = n (cid:80) ni =1 (cid:98) C ( i ) [ h ] , where (cid:98) C ( i ) [ h ] ∈ [0 , K × K , and (cid:98) C ( i ) k,(cid:96) [ h ] = (cid:74) y i = k (cid:75) h (cid:96) ( x i ) . Here, (cid:74) · (cid:75) is the indicator function, so (cid:80) Kk =1 (cid:80) K(cid:96) =1 (cid:98) C ( i ) k,(cid:96) [ h ] = 1 . The empirical group-speciﬁcconfusion matrices (cid:98) C g are computed by conditioning on groups. In the empirical case, it is convenient to representgroup memberships via indices alone, i.e., x i ∈ g as i ∈ g . We have (cid:98) C g [ h ] = | g | (cid:80) i ∈ g (cid:98) C ( i ) [ h ] . Fairness constraints.

Let G fair represent the (potentially overlapping) set of groups across which we wish to enforcefairness. The following states our formal assumptions on G fair . Assumption 2.1. G fair is a function of the sensitive attributes A only.We will focus the discussion on common cases in the literature. These include non-overlapping (unrestricted, intersec-tional), and overlapping (independent, gerrymandering) group partitions. • Unrestricted case.

The simplest case is where the group is deﬁned by a single sensitive attribute (when thereare multiple sensitive attributes, all but one are ignored). These have been the primary settings addressed bypast literature (Hardt et al., 2016a; Narasimhan, 2018; Agarwal et al., 2018). Thus for some ﬁxed i ∈ [ M ] , g j = { ( z , a ) | a i = j } , so |G unrestricted | = | A i | . In the special case of binary sensitive attributes, |G unrestricted | = 2 . • Intersectional groups . Here, the non-overlapping groups are associated with all possible combinations ofsensitive features. Thus g a = { ( z , a (cid:48) ) | a (cid:48) = a } ∀ a ∈ A so |G intersectional | = (cid:81) m ∈ M | A m | . In the special caseof binary sensitive attributes, |G intersectional | = 2 M . • Independent groups . Here, the groups are overlapping, with a set of groups associated with each fairnessattribute separately. It is convenient to denote the groups based on indices representing each attribute, and eachpotential setting. Thus g i,j = { ( z , a ) | a i = j } , so |G independent | = (cid:80) m ∈ M | A m | . In the special case of binarysensitive attributes, |G independent | = 2 M . • Gerrymandering intersectional groups . Here, group intersections are deﬁned by any subset of the sensitiveattributes, leading to overlapping subgroups. G gerrymandering = {{ ( z , a ) : a I = s } : I ⊂ [ M ] , s ∈ A I } where a I denotes a restricted to the entries indexed by I . It is also the closure of G independent under intersection. Asa result, G intersectional ⊆ G gerrymandering , and G independent ⊆ G gerrymandering . In the special case of binary sensitiveattributes, |G gerrymandering | = 3 M . Fairness metrics.

We formulate group fairness by upper bounding a fairness violation function V : H (cid:55)→ R J which can be represented as a linear function of the confusion matrices, i.e. V ( h ) = Φ( C [ h ] , { C g [ h ] } g ∈G fair ) where ∀ j ∈ [ J ] , V ( h ) j = φ j ( C [ h ] , { C g [ h ] } g ∈G fair ) = (cid:104) U j , C (cid:105)− (cid:80) g ∈G fair (cid:10) V gj , C g (cid:11) . This formulation is sufﬁciently ﬂexibleto include the fairness statistics we are aware of in common use as special cases. For example, demographic parity forbinary classiﬁers (Dwork et al., 2012) can be deﬁned by ﬁxing C g , + C g , across groups. Equal opportunity (Hardtet al., 2016b) is recovered by ﬁxing the group-speciﬁc true positives, using population speciﬁc weights, i.e., φ ± DP = ± ( C g , + C g , − C , + C , ) − ν, φ ± EO = ± (cid:18) ω g C g , − ω C , (cid:19) − ν, using both a positive and negative constraint to penalize both positive and negative deviations between the group andthe population, and relaxation ν . Performance metrics.

We consider an error metric E : H (cid:55)→ R + that is a linear function of the populationconfusion E ( h ) = ψ ( C ) = (cid:104) D , C [ h ] (cid:105) . This setting has been studied in binary classiﬁcation (Yan et al., 2018),multiclass classiﬁcation (Narasimhan et al., 2015), multilabel classiﬁcation (Koyejo et al., 2015), and multioutputclassiﬁcation (Wang et al., 2019). For instance, standard classiﬁcation error corresponds to setting D = 1 − I . The goalis to learn the Bayes-optimal classiﬁer with respect to the given metric, which, when it exists, is given by: h ∗ ∈ argmin h E ( h ) s.t. V ( h ) ≤ . (1)We denote the optimal error as E ∗ = E ( h ∗ ) . We say a classiﬁer h N constructed using ﬁnite data of size N is {E , V} -consistent if E ( h n ) P −→ E ∗ and V ( h n ) P −→ , as n → ∞ . We also consider empirical versions of error ˆ E ( h ) = ψ ( (cid:98) C [ h ]) and fairness violation (cid:98) V ( h ) = Φ( (cid:98) C [ h { (cid:98) C g [ h ] } g ) . 3airness with Overlapping Groups A P

REPRINT

Table 1: Examples of multiclass performance metrics and fairness metrics studied in this manuscript.Metric ψ ( C ) Fairness Metric φ ( C , { C g } g ) Weighted Acc. (cid:80) Ki =1 (cid:80) Kj =1 b i,j C i,j Demographic Parity ( C g , + C g , − C , + C , ) − ν Ordinal Acc. (cid:80) Ki =1 (cid:80) Kj =1 (1 − K − | i − j | ) C i,j Equalized Opportunity (cid:16) ω g C g , − ω g C , (cid:17) − ν In this section, we identify a parametric form for the Bayes-optimal group-fair classiﬁer under standard assumptions.To begin, we introduce the following general assumption on the joint distribution.

Assumption 3.1 ( η -continuity) . Assume P ( { η ( x ) = c } ) = 0 ∀ c ∈ ∆ K . Furthermore, let Q = η ( x ) be a randomvariable with density p η ( Q ) , where p η ( Q ) is absolutely continuous with respect to the Lebesgue measure restricted to ∆ K .This assumption imposes that the conditional probability as a random variable has a well-deﬁned density. Analogousregularity assumptions are widely employed in literature on designing well-deﬁned complex classiﬁcation metrics andseem to be unavoidable (we refer interested reader to Yan et al. (2018); Narasimhan et al. (2015) for details). Next, wedeﬁne the general form of weighted multiclass classiﬁers, which are the Bayes-optimal classiﬁers for linear metrics. Deﬁnition 3.2. [Narasimhan et al. (2015)] Given a loss matrix W ∈ R K × K , a weighted classiﬁer h satisﬁes h i ( x ) > only if i ∈ arg min k ∈ [ K ] (cid:104) W k , η ( x ) (cid:105) .Next we present our ﬁrst main result identifying the Bayes-optimal group-fair classiﬁer. Theorem 3.1.

Under Assumption 2.1 and Assumption 3.1, if (1) is feasible (i.e., a solution exists), the Bayes-optimalclassiﬁer is given by h ∗ ( x ) = h ∗ ( z , a ) = β a h ( x ) + (1 − β a ) h ( x ) , where β a ∈ (0 , , ∀ a ∈ A and h i ( x ) areweighted classiﬁers with weights {{ W i, a } i ∈{ , } } a ∈A . One key observation is that pointwise, the Bayes-optimal classiﬁer can be decomposed based on intersectional groups G intersectional = A , even when G fair is overlapping. This observation will prove useful for algorithms. Recent research Kearns et al. (2018) has shown how imposing overlapping group fairness using independent fairnessrestrictions can lead to violation of intersectional fairness, primarily via examples. This observation led to the term fairness gerrymandering . Here, we examine this claim more formally, showing that enforcing intersectional fairnesscontrols overlapping fairness, although the converse is not always true, i.e., enforcing overlapping fairness does notimply intersectional fairness. We show this result for the general case of quasi-convex fairness measures, with linearfairness metrics recovered as a special case.

Proposition 3.2.

For any G fair that satisﬁes assumption 2.1, suppose φ : [0 , K × K × [0 , K × K → R + is quasiconvex, φ ( C , C g ) ≤ ∀ g ∈ G intersectional = ⇒ φ ( C , C g ) ≤ ∀ g ∈ G fair . The converse does not hold.

Remark 3.3.

Note that the converse claim of Proposition 3.2, does not apply to G gerrymandering . Controlling the gerry-mandering fairness violation implies control of the intersectional fairness violation, since G intersectional ⊆ G gerrymandering . Here we present

GroupFair , a general empirical procedure for solving (1). The Lagrangian of the constrainedoptimization problem (1) is L ( h , λ ) = E ( h ) + λ (cid:62) V ( h ) with empirical Lagrangian ˆ L ( h , λ ) = ˆ E ( h ) + λ (cid:62) ( V ( h ) − ε ) ,where ε is a buffer for generalization. Our approach involves ﬁnding a saddle point of the Lagrangian. The returnedclassiﬁers will be probabilistic combinations of classiﬁers in H , i.e. the procedure returns a classiﬁer in conv( H ) . In thefollowing, we ﬁrst assume the dual parameter λ is ﬁxed, and describe the primal solution as a classiﬁcation oracle. Weconsider both plugin and weighted ERM. In brief, the plugin estimator ﬁrst proceeds assuming η ( x ) is known, then we plugin the empirical estimator ˆ η ( x ) in its place. The plugin approach has the beneﬁt of low computational complexityonce ﬁxed. On the other hand, the weighted ERM estimator requires the solution of a weighted classiﬁcation problemin each round, but avoids the need for estimating ˆ η ( x ) . 4airness with Overlapping Groups A P

REPRINT

Algorithm 1:

GroupFair , Group-fair classiﬁcation with overlapping groups,

Input: ψ : [0 , K × K → [0 , , Φ : [0 , K × K × ([0 , K × K ) G fair → [0 , J samples { ( x , y ) , . . . , ( x n , y n ) } .Initialize λ ∈ [0 , B ] J ; for t = 1 , . . . , T do h t ← MinOracle h ∈H ( L ( h, λ t ) , z n ) ; λ t +1 ← Update t ( λ t , Φ( (cid:98) C [ h t ] , { (cid:98) C g [ h t ] } g ∈G fair ) − ε ) ; end ¯ h T ← T (cid:80) Tt =1 h t , ¯ λ T ← T (cid:80) Tt =1 λ t ; return (¯ h T , ¯ λ T ) In the weighed ERM approach we parametrize h : X → [ K ] by a function class F of functions f : X → R K . Theclassiﬁcation is the argmax of the predicted vector, h ( x ) = argmax j ( f ( x ) j ) , so we denote the set of classiﬁers as H werm = argmax ◦F . The following special case of Deﬁnition 1 in (Ramaswamy and Agarwal, 2016) outlines therequired conditions for weighted multiclass classiﬁcation calibration. This is commonly referred to as cost-sensitiveclassiﬁcation (Agarwal et al., 2018) when applied to binary classiﬁcation. Deﬁnition 4.1 ( W -calibration (Ramaswamy and Agarwal, 2016)) . Let W ∈ R K × K + . A surrogate function L : R K → R K + is said to be W -calibrated if ∀ p ∈ ∆ K : inf u :argmax( u ) / ∈ argmin k ( p (cid:62) W ) k p (cid:62) L ( u ) > inf u p (cid:62) L ( u ) . Note that the weights are sample (group) speciﬁc – which, while uncommon, is not new, e.g., Ávila Pires et al. (2013).

Proposition 4.1.

The weighted ERM estimator for average fairness violation is given by: h ( x ) =argmax j ( f ∗ ( x ) j ) , f ∗ = min f ∈F ˆ L ( f ); where ˆ L ( f ) = ˆ E [ y T L ( f )] is a multiclass classiﬁcation surrogate for theweighted multiclass error with group-dependent weights ∀ a ∈ A W ( x ) =  D + J (cid:88) j =1 λ j (cid:18) U j − (cid:88) g ∈G fair a ∈ g ˆ π ( g ) V gj (cid:19) . (2) The plugin hypothesis class are the weighted classiﬁers, identiﬁed by Theorem 3.1 as H plg = { h ( x ) =argmin j ∈ [ K ] (ˆ η ( x ) (cid:62) B ( x )) j : B ( x ) ∈ R K × K } . Here, we focus on the average violation case only. By simply-reordering terms, the population problem can be determined as follows. Proposition 4.2.

The plug-in estimator for average fairness violation is given by ˆ h ( x ) = argmin k ∈ [ K ] ( η ( x ) (cid:62) W ( x )) k ,where W ( x ) is deﬁned in (2) . GroupFair , a General Group-Fair Classiﬁcation Algorithm

We can now present

GroupFair , a general algorithm for group-fair classiﬁcation with overlapping groups, as outlinedin Algorithm 1. As outlined, our approach proceeds in rounds, updating the classiﬁer oracle and the dual variable.Interleaved with the primal update is a dual update

Update t ( λ , v ) via gradient descent on the dual variable. Theresulting classiﬁer is the average over the oracle classiﬁers. Recovery of existing methods.

When the groups are non-overlapping,

GroupFair with the Plugin oracle and projectedgradient ascent update recovers FairCOCO (Narasimhan, 2018). Similarly, when the groups are non-overlapping,and the labels are binary,

GroupFair with the weighted ERM oracle and exponentiated gradient update recoversFairReduction (Agarwal et al., 2018) (see also Table 2). Importantly,

GroupFair enables a straightforward extension tooverlapping groups. 5airness with Overlapping Groups

A P

REPRINT

MinOracle h ∈H ( L ( h, λ t ) , z n ) Update t ( λ , v ) FairReduction H ◦ argmin f ∈F ˆ L ( f ) B exp(log λ − η t v ) B − (cid:80) Mj =1 λ j + (cid:80) Mj =1 exp(log λ i − η t v i ) FairCOCO plugin(ˆ η , (ˆ π g ) g ∈G fair , ψ, Φ , λ t ) proj [0 ,B ] M ( λ + η t v ) Table 2: The oracles shown are plugin (6) and ERM on the reweighted ˆ L (7). H = [argmax k ∈ [ K ] ( · ) k ] converts afunction X → R K to a classiﬁer. In FairCOCO, ˆ η is estimated from samples z n/ = { ( x , y ) , . . . , ( x n/ , y n/ ) } and all of the other probability estimates (ˆ π g ) g and { (cid:98) C g [ h t ] } g are estimated from z n/ = z n \ z n/ . Here we discuss the consistency of the weighted ERM and the plugin approaches. For any class H = { h : X → [ K ] } ,denote H k = { { h ( x )= k } : h ∈ H} . We assume WLOG that VC( H ) = . . . = VC( H K ) and denote this quantity as VC( H ) . Next, we give a theorem relating the performance and satisfaction of constraints of an empirical saddle pointto an optimal fair classiﬁer. Theorem 5.1.

Suppose ψ : [0 , K × K → [0 , and Φ : [0 , K × K × ([0 , K × K ) G fair → [0 , J are ρ -Lipschitz w.r.t. (cid:107) · (cid:107) ∞ . Recall ˆ L ( h , λ ) = ˆ E ( h ) + λ (cid:62) ( ˆ V ( h ) − ε ) . Deﬁne γ ( n (cid:48) , H , δ ) = (cid:113) VC ( H ) log( n )+log(1 /δ ) n . If n min =min g ∈G fair n g , ε = Ω ( ργ ( n min , H , δ )) then w.p. − δ :If (¯ h , ¯ λ ) is a ν -saddle point of max λ ∈ [0 ,B ] J min h ∈ conv H ˆ L ( h , λ ) , in the sense that max λ ∈ [0 ,B ] J ˆ L (¯ h , λ ) − min h ∈ conv( H ) ˆ L ( h , ¯ λ ) ≤ ν , and h ∗ ∈ conv( H ) satisﬁes V ( h ∗ ) ≤ , then E (¯ h ) ≤ E ( h ∗ ) + ν + O ( ργ ( n, H , δ )) , (cid:107)V (¯ h ) (cid:107) ∞ ≤ νB + O ( ργ ( n min , H , δ )) + ε. Thus, as long as we can ﬁnd an arbitrarily good saddle point, which weighted ERM grants if H werm is expressiveenough while having ﬁnite VC dimension, then we obtain consistency. A saddle point can be found by running agradient ascent algorithm on λ conﬁned to [0 , B ] J , which repeatedly computes h t = argmin h ∈H ˆ L ( h, λ t ) ; the ﬁnal (¯ h , ¯ λ ) are the averages of the primal and dual variables computed throughout the algorithm.Although Theorem 5.1 captures the spirit of the argument for the plugin algorithm, it only applies naturally to theweighted ERM algorithm. This is because the plugin algorithm is solving a subtly different minimization problem: itreturns h t as the population minimum , if the estimated regression function ˆ η replaces the true regression function . Theorem 5.2.

With probability at least − δ , if projected gradient ascent is run as Update t ( λ , v ) = proj [0 ,B ] J ( λ + η v ) for T iterations with step size η = B √ T and for t = 1 , . . . , T, h t = plugin(ˆ η , (ˆ π g ) g ∈G fair , ψ, Φ) , letting ρ =max {(cid:107) ψ (cid:107) , (cid:107) φ (cid:107) , . . . , (cid:107) φ M (cid:107) } , ρ g = (cid:80) Jj =1 (cid:107) V gj (cid:107) ∞ , ρ X = (cid:107) D (cid:107) ∞ + (cid:80) Jj =1 (cid:107) U j (cid:107) ∞ , ∆ η = E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) , ˇ n =min g ∈G fair n g , then κ := O  Jρ (cid:115) K log(ˇ n ) + log( |G fair | K δ )ˇ n  + ∆ η  ρ X + (cid:88) g ∈G fair ρ g π g  + (cid:115) log( |G fair | δ ) n (cid:88) g ∈G fair ρ g π g = ⇒ E ψ (¯ h T ) ≤ E ∗ ψ + JB √ T + O ( BJκ ) , (cid:107)V φ (¯ h T ) (cid:107) ∞ ≤ J √ T + O ( Jκ ) . A key point in the presented analyses (for both procedures) is that the dominating statistical properties depend on thenumber of fairness groups. We note that |G fair | (cid:28) |G intersectional | = |A| for the independent case, so this signiﬁcantlyimproves results. More broadly, we conjecture that the statistical bounds depend on min( |G fair | , |G intersectional | ) , and leavethe details to future work. We also note the statistical dependence on the size of the smallest group. This seems to beunavoidable, as we need an estimate of the group fairness violation in order to control it. To this end, group violationsmay be scaled by group size, which leads instead to a dependence on the VC dimension of G fair , improving statisticaldependence with small groups at the cost of some fairness Kearns et al. (2018). We expect that the bounds may beimproved by a more reﬁned analysis, or modiﬁed algorithms with stronger assumptions. We leave this detail to futurework. 6airness with Overlapping Groups A P

REPRINT

Table 3: Average training times (averaged over the training sessions for each fairness parameter). The Plugin Oracle issigniﬁcantly faster than other approaches. Independent GerrymanderingC& C Adult German Law school Adult German Law school

Weighted-ERM

Plugin

Regularizer

Kearns et al.

N/A N/A N/A N/A 2213.7 821.5 s 1674.4 s

Recent work by Foulds et al. (2018); Kearns et al. (2018) and Hebert-Johnson et al. (2018) were among the ﬁrst todeﬁne and study intersectional fairness with respect to parity and calibration metrics respectively. Narasimhan (2018)provide a plugin algorithm for group fairness and generalization guarantees for the unrestricted case. (Menon andWilliamson, 2018) considered Bayes optimality of fair binary classiﬁcation where the sensitive attribute is unknown attest time, using an additional sensitive attribute regressor. Cotter et al. (2018) provide proxy-Lagrangian algorithmwith generalization guarantees, assuming proxy constraint functions which are strongly convex, and argue that bettergeneralization is achieved by reserving part of the dataset for training primal parameters and part of the dataset fortraining dual parameters. Celis et al. (2018) provide an algorithm with generalization guarantees for independent groupfairness based on solving a grid of interval constrained programs; their and Narasimhan (2018)’s work are most similarto ours.

We consider demographic parity as the fairness violation, i.e., φ ± DP = ± ( C g , + C g , − C , + C , ) − ν, combinedwith 0-1 error ψ ( C ) = C + C as the error metric. All labels and protected attributes are binary or binarized. Weuse the following datasets (details in the appendix): (i) Communities and Crime, (ii) Adult census, (iii) German creditand (iv) Law school. Evaluation Metric.

We compute the "fairness frontier" of each method – that is, we vary the constraint level ν . Weplot the fairness violation and the error rate on the train set and a test set. The fairness violation for demographic parityis deﬁned by fairviol DP = max g ∈G fair | (cid:98) C g , + (cid:98) C g , − (cid:98) C , − (cid:98) C , | Observe that on the training set, it is always possible to achieve extreme points by ignoring either the classiﬁcation erroror the fairness violation.

Baseline:

Regularizer is a linear classiﬁer implemented by using Adam to minimize logistic loss plus the followingregularization function: ρ M (cid:88) j =1 (cid:32) (cid:80) i :( z i ) j =1 σ ( w (cid:62) x i ) |{ i : ( z i ) j = 1 }| − (cid:80) ni =1 σ ( w (cid:62) x i ) n (cid:33) (3)where σ ( r ) = e − r is the sigmoid function. This penalizes the squared differences between the average predictionprobabilities for each group and the overall average prediction probability. Other existing methods we are aware of areeither not applicable to overlapping groups, or are special cases of GroupFair . Experiment 1: Independent group fairness.

We consider independent group fairness, deﬁned by consideringprotected attributes separately. Our results compare extensions of FairCOCO (Narasimhan, 2018) and a FairRe-duction (Agarwal et al., 2018), existing special cases of

GroupFair using the plugin and weighted ERM oraclesrespectively. Results are shown in Figure 1. We further present the differences in training time in 3. On all datasets,the variants of

GroupFair are much more effective than a generic regularization approach. However,

Plugin seemsto violate fairness more often at test time – perhaps this is due to the (cid:107) ˆ η − η (cid:107) term in the generalization bound inTheorem 5.2. At the same time, Plugin is almost 2 orders of magnitude faster, since its

MinOracle essentially has aclosed-form solution, while

Weighted-ERM has to solve a new ERM problem in each iteration.

Experiment 2: Gerrymandering group fairness.

Unfortunately, intersectional fairness is not statistically estimable inmost cases as most intersections are empty. As a remedy, (Kearns et al., 2018) propose max-violation fairness constraints7airness with Overlapping Groups

A P

REPRINT

Figure 1: Experiments on independent group fairness, showing fairness frontier. The pareto frontier closest to thebottom left represent the best fairness/performance tradeoff.Figure 2: Experiments on gerrymandering group fairness. The pareto frontier closest to the bottom left represent thebest fairness/performance tradeoff.over G gerrymandering , where each group is weighed by group size, i.e., max g ∈G gerrymandering | g | n | (cid:98) C g , + (cid:98) C g , − (cid:98) C , − (cid:98) C , | , soempty groups are removed, and small groups have relatively low inﬂuence unless there is a very large fairness violation.We denote the approach of Kearns et al. (2018) as Kearns et al.

This approach is closely related to

Weighted-ERM but searches for the maximally violated group by solving a cost-sensitive classiﬁcation problem and uses ﬁctitiousplay between λ and h . For the Plugin and

Weighted-ERM approaches, we optimize the cost function directly usinggradient ascent, precomputing the gerrymandering groups present in the data. Results are shown in Figure 2. We furtherpresent the differences in training time in Table 3. The results are roughly equivalent in terms of performance, however,both the

Weighted-ERM and

Plugin approach are 1-2 orders of magnitude faster than

Kearns et al.

This manuscript considered algorithmic fairness across multiple overlapping groups simultaneously. Using a prob-abilistic population analysis, we present the Bayes-optimal classiﬁer, which motivates a general-purpose algorithm,

GroupFair . Our approach uniﬁes a variety of existing group-fair classiﬁcation methods and enables extensions toa wide range of non-decomposable multiclass performance metrics and fairness measures. Future work will includeextensions beyond linear metrics, to consider more general fractional and convex metrics. We also wish to explore morecomplex prediction settings beyond classiﬁcation.

References

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fairclassiﬁcation. In

International Conference on Machine Learning , pages 60–69, 2018.Boucheron, Stéphane, Bousquet, Olivier, and Lugosi, Gábor. Theory of classiﬁcation: a survey of some recent advances.

ESAIM: PS , 9:323–375, 2005. doi: 10.1051/ps:2005018. URL https://doi.org/10.1051/ps:2005018 .Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classiﬁca-tion. In Sorelle A. Friedler and Christo Wilson, editors,

Proceedings of the 1st Conference on Fairness, Accountabilityand Transparency , volume 81 of

Proceedings of Machine Learning Research , pages 77–91, New York, NY, USA,23–24 Feb 2018. PMLR. URL http://proceedings.mlr.press/v81/buolamwini18a.html .8airness with Overlapping Groups

A P

REPRINT

Toon Calders and Sicco Verwer. Three naive bayes approaches for discrimination-free classiﬁcation.

Data Min. Knowl.Discov. , 21:277–292, 09 2010. doi: 10.1007/s10618-010-0190-x.L. Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K. Vishnoi. Classiﬁcation with Fairness Constraints: AMeta-Algorithm with Provable Guarantees. arXiv e-prints , art. arXiv:1806.06055, June 2018.Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv e-prints , art. arXiv:1703.00056, Feb 2017.Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, andSeungil You. Training Well-Generalizing Classiﬁers for Fairness Metrics and Other Data-Dependent Constraints. arXiv e-prints , art. arXiv:1807.00028, June 2018.Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness.In

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference , ITCS ’12, pages 214–226,New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1115-1. doi: 10.1145/2090236.2090255. URL http://doi.acm.org/10.1145/2090236.2090255 .James Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. An Intersectional Deﬁnition of Fairness. arXive-prints , art. arXiv:1807.08362, Jul 2018.Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In

Proceedings of the30th International Conference on Neural Information Processing Systems , NIPS’16, pages 3323–3331, USA, 2016a.Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=3157382.3157469 .Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In

Proceedings of the30th International Conference on Neural Information Processing Systems , NIPS’16, pages 3323–3331, USA, 2016b.Curran Associates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=3157382.3157469 .Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the(Computationally-identiﬁable) masses. In Jennifer Dy and Andreas Krause, editors,

Proceedings of the 35thInternational Conference on Machine Learning , volume 80 of

Proceedings of Machine Learning Research , pages1939–1948, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/hebert-johnson18a.html .Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach.pages 643–650, 12 2011. doi: 10.1109/ICDMW.2011.83.Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing andlearning for subgroup fairness. In Jennifer Dy and Andreas Krause, editors,

Proceedings of the 35th InternationalConference on Machine Learning , volume 80 of

Proceedings of Machine Learning Research , pages 2564–2572,Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/kearns18a.html .Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent multilabelclassiﬁcation. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 28 , pages 3321–3329. Curran Associates, Inc., 2015.Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan,Trevor Back, Mary Chesus, Greg C Corrado, Ara Darzi, et al. International evaluation of an ai system for breastcancer screening.

Nature , 577(7788):89–94, 2020.Albert Meijer and Martijn Wessels. Predictive policing: Review of beneﬁts and drawbacks.

International Journal ofPublic Administration , 42(12):1031–1039, 2019.Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classiﬁcation. In Sorelle A. Friedler andChristo Wilson, editors,

Proceedings of the 1st Conference on Fairness, Accountability and Transparency , volume 81of

Proceedings of Machine Learning Research , pages 107–118, New York, NY, USA, 23–24 Feb 2018. PMLR. URL http://proceedings.mlr.press/v81/menon18a.html .Harikrishna Narasimhan. Learning with complex loss functions and constraints. In Amos Storkey and FernandoPerez-Cruz, editors,

Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics ,volume 84 of

Proceedings of Machine Learning Research , pages 1646–1654, Playa Blanca, Lanzarote, CanaryIslands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/narasimhan18a.html .9airness with Overlapping Groups

A P

REPRINT

Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithmsfor complex performance measures. In

Proceedings of the 32nd International Conference on Machine Learning(ICML-15) , pages 2398–2407, 2015.Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss matrices.

The Journal ofMachine Learning Research , 17(1):397–441, 2016.Chih-Fong Tsai and Jhen-Wei Wu. Using neural network ensembles for bankruptcy prediction and credit scoring.

Expert systems with applications , 34(4):2639–2649, 2008.Xiaoyan Wang, Ran Li, Bowei Yan, and Oluwasanmi Koyejo. Consistent classiﬁcation with generalized metrics. arXivpreprint arXiv:1908.09057 , 2019.Bowei Yan, Sanmi Koyejo, Kai Zhong, and Pradeep Ravikumar. Binary classiﬁcation with karmic, threshold-quasi-concave metrics. In

Proceedings of the 35th International Conference on Machine Learning , volume 80, pages5531–5540. PMLR, 2018.Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In SanjoyDasgupta and David McAllester, editors,

Proceedings of the 30th International Conference on Machine Learning ,volume 28 of

Proceedings of Machine Learning Research , pages 325–333, Atlanta, Georgia, USA, 17–19 Jun 2013.PMLR. URL http://proceedings.mlr.press/v28/zemel13.html .Bernardo Ávila Pires, Csaba Szepesvari, and Mohammad Ghavamzadeh. Cost-sensitive multiclass classiﬁcation riskbounds. In Sanjoy Dasgupta and David McAllester, editors,

Proceedings of the 30th International Conference onMachine Learning , volume 28 of

Proceedings of Machine Learning Research , pages 1391–1399, Atlanta, Georgia,USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/avilapires13.html .10airness with Overlapping Groups

A P

REPRINT

AppendixA Bayes optimal

Theorem 3.1.

Under Assumption 2.1 and Assumption 3.1, if (1), i.e., h ∗ ∈ argmin h E ( h ) s.t. V ( h ) ≤ , is feasible (i.e., a solution exists), the Bayes-optimal classiﬁer is given by h ∗ ( x ) = h ∗ ( z , a ) = β a h ( x ) + (1 − β a ) h ( x ) , where β a ∈ (0 , , ∀ a ∈ A and h i ( x ) are weighted classiﬁers with weights {{ W i, a } i ∈{ , } } a ∈A . Proof.

The key idea of the proof is to exploit the problem representation in terms of confusion matrices. The proof hastwo main steps (i) population analysis for feasible confusion matrices, and (ii) plug-in of the classiﬁers that achieve theBayes optimal confusion.

Confusion space.

As the ﬁrst step, let C g = { C g ( h ) | h ∈ H} be all group g speciﬁc confusion matrices, and let C G fair = (cid:81) g ∈G fair C g be the product space of all confusion matrices corresponding to fair groups associated with agiven instance of the problem. Similarly, let C A = (cid:81) g ∈G intersectional C g be the product space of all confusion matricescorresponding to intersectional groups. A standard property of confusion matrices is that each C g is a convexset Narasimhan et al. (2015); Narasimhan (2018); Wang et al. (2019). Thus, each C ∈ C g can be described as a mixtureof two boundary points, i.e., ∀ C ∈ C g ∃ C , C ∈ ∂ C g , β ∈ [0 , , s.t. C = β C + (1 − β ) C Another useful fact is that all confusion matrices on the boundary can be achieved by a weighted classiﬁer Narasimhanet al. (2015); Narasimhan (2018); Wang et al. (2019). This fact follows from the convexity of the set C g , and is simplya dual representation – via support functions, i.e., ∀ C ∈ ∂ C g , ∃ W s.t. C = Conf g ( h ∗ ) , where h ∗ ∈ argmax h ∈H (cid:104) W , Conf g ( h ) (cid:105) , and where, for notation clarity, we have Conf ( h ) as the confusion matrix of classiﬁer h , and Conf g ( h ) as the group-restricted confusion matrix. Further, the solution h ∗ can be represented as a weighted classiﬁer (Deﬁnition 3.2)Narasimhan (2018); Wang et al. (2019). Population confusion problem.

Recall that the population confusion can be decoposed into their intersectionalcounterparts C = (cid:80) a ∈G intersectional P ( a ) C a . Similarly, each overlapping group confusion can be decomposed using theintersection confusions as C g ∈ C G fair , C g = (cid:80) a ∈G intersectional P ( a | g ) C a .As the overall metric is a function of confusion matrices only, we can re-state (1) as the equivalent confusion problem(with slight abuse of notation) for any G fair as: C ∗ , { C g, ∗ } = argmin ψ ( C ) s.t. Φ( C , { C g } ) ≤ , C = (cid:88) a ∈G intersectional P ( a ) C a C g = (cid:88) a ∈G intersectional P ( a | g ) C a C a = Conf a ( h ) . After substituting the population C and the group confusions C g with the presented linear functions of C a , this isequivalent to the problem { C a, ∗ } = argmin ψ ( { C a } ) s.t. Φ( { C a } ) ≤ , C a = Conf a ( h ) . Here, we have used the linearity of the cost functions ψ and Φ , and the linearity of the confusion matrix decompositionsinto intersectional confusion matrices. Putting it together.

The ﬁnal step is noting that a solution, if it exists, can be represented by feasible intersectionalconfusion matrices { C a, ∗ } , and in turn, each intersectional confusion matrix can be recovered as a weighted average oftwo intersectional boundary confusion matrices. Thus the corresponding classiﬁers can be recovered by a mixture oftwo weighted classiﬁers. 11airness with Overlapping Groups A P

REPRINT

B Independent vs. intersectional group fairness

Proposition 3.2.

For any G fair that satisﬁes assumption 2.1, suppose φ : [0 , K × K × [0 , K × K → R + is quasiconcavein its second argument, φ ( C , C g ) ≤ ∀ g ∈ G intersectional = ⇒ φ ( C , C g ) ≤ ∀ g ∈ G fair . The converse does not hold.

Proof. (For the forward direction)Recall that f is quasiconcave if f ( (cid:80) i λ i z i ) ≤ max i { f ( z i ) } . When φ is quasiconvex, for any G fair , we can compute φ ( C , C g ) = φ ( C , (cid:80) a ∈G intersectional λ a C a ) ≤ max a ∈G intersectional φ ( C , C a ) , where λ a are linear weights (corresponding toinclusion probabilities).Since φ ( C , C a ) ≤ by the claim, it follows that φ ( C , C a ) ≤ ∀ a ∈ G intersectional = ⇒ φ ( C , C g ) ≤ ∀ g ∈ G fair . Converse.

Though the above applies to any quasiconcave metric, in this manuscript we mainly consider linear metrics.As a corollary, intersectional group fairness with respect to common fairness metrics such as demographic parity orequal opportunity implies independent group fairness. A simple xor-like example from (Kearns et al., 2018) shows thatthe converse is not true.We provide another counterexample to the converse, showing a gap between independent and intersectional demographicparity (DP) group fairness, on an example with more realistic structure.

Example B.1.

Let A , A , A be binary attributes and { A m } denote the event { A m = 1 } . If P ( Y ) = P ( A ) = P ( A ) = P ( A ) = 0 . , A , A , A are both independent and conditionally independent given Y , and P ( A m | Y ) = 0 . ,then for every P, N ⊂ { , , } with P ∩ N = ∅ P ( Y | ∩ i ∈ P A i , ∩ j ∈ N ¯ A j ) = 0 . . | P | (0 . | N | . Proposition B.1.

An optimal (DP) intersectionally fair ˆ Y has, over every possible subgroup G = ∩ i ∈ P A i ∩ j ∈ N A j , P ( ˆ Y | G ) = 0 .

384 = 0 . . (0 . and has an error of . .On the other hand, an optimal (DP) independently fair classiﬁer has P ( ˆ Y | A , A , A ) = 0 . , P ( ˆ Y | A i , A j , ¯ A k ) =0 . , P ( ˆ Y | A i , ¯ A j , ¯ A k ) = 0 . , P ( ˆ Y | ¯ A i , ¯ A j , ¯ A k ) = 0 . and has an error of . . Interestingly, even though P ( Y | A , A , A ) = 0 . and P ( Y | ¯ A , ¯ A , ¯ A ) = 0 . have the highest and lowestprobabilities, the reverse is true of the predictor ˆ Y – it sacriﬁces accuracy on these groups to obtain higher accuracy onmixed positive/complement intersections.Here we set up and discuss the example in 3.2 in more detail. First we begin with a rigorous and more generaldescription of the structure of the example – here, one can think of a binary attribute as being synonymous with apartition with two sections. The ﬁrst section corresponds to individuals with a value of 1 for that attribute and the othersection to those with a value of 0. Assumption B.2 (Independence) . Assume that the binary attributes A , A , . . . , A M and label Y satisfy:1. A , . . . , A M are independent.2. A , . . . , A M are independent conditioned on Y .In the following, when A j is used to denote an event inside a probability, it refers to the event { A j = 1 } . ¯ A j refers tothe event { A j = 0 } . We also use the notation A j = A j and ¯ A j = A j . Proposition B.2.

For every j = 1 , . . . , M, deﬁne q j = P ( A j | Y ) and a j = P ( A j ) . Then, under Assumption B.2,for any index set J = { j , j , . . . , j J } and ( b j ) j ∈ J ∈ { , } J , P ( Y | A b j j , j ∈ J ) = J (cid:89) k =1 (cid:18) q j k a j k (cid:19) b k (cid:18) − q j k − a j k (cid:19) − b k A P

REPRINT

Proof. P ( Y | A b j , . . . , A b J j J ) = P ( Y, A b j , . . . , A b J j J ) P ( A b j , . . . , A b J j J )= P ( Y ) J (cid:89) k =1 P ( A b k j k | Y, A b j , . . . , A b k − j k − ) P ( A b k j k | A b j , . . . , A b k − j k − )= P ( Y ) J (cid:89) k =1 P ( A b k j k | Y ) P ( A b k j k )= P ( Y ) J (cid:89) k =1 (cid:18) q j k a j k (cid:19) b k (cid:18) − q j k − a j k (cid:19) − b k . The third line follows by independence, Assumption B.2.The idea behind the above proposition is that with the independence assumption B.2, the structure of P ( Y | A b , . . . , A b M M ) is such that we have P ( Y ) scaled either by q j /a j or (1 − q j ) / (1 − a j ) depending on whether we are in A j or ¯ A j . This in a sense makes the effects of protected attributes “pile on.” If we assume WLOG that q j /a j ≥ , then (1 − q j ) / (1 − a j ) ≤ . Example B.3.

Suppose that M = 3 , P ( Y ) = 0 . , and for every j = 1 , , , a j = P ( A j ) = 0 . and q j = P ( A j | Y ) = 0 . . (This is possible because for every J, ≤ P ( Y | A j , j ∈ J ) ≤ , aka is a well deﬁned probability.)Applying Proposition B.2 noting q j a j = 1 . , − q j − a j = 0 . , P ( Y | A ) = P ( Y | A ) = P ( Y | A ) = 0 . · . . ,P ( Y | ¯ A ) = P ( Y | ¯ A ) = P ( Y | ¯ A ) = 0 . · . . ,P ( Y | A , A ) = P ( Y | A , A ) = P ( Y | A , A ) = 0 . · (1 . = 0 . ∀ ≤ j, k ≤ , P ( Y | A j , ¯ A k ) = 0 . · . · . . ∀ ≤ j, k ≤ , P ( Y | ¯ A j , ¯ A k ) = 0 . · . · . . P ( Y | A , A , A ) = 0 . · (1 . = 0 . ∀ ≤ i, j, k ≤ , P ( Y | A i , A j , ¯ A k ) = 0 . · (1 . · . . ∀ ≤ i, j, k ≤ , P ( Y | A i , ¯ A j , ¯ A k ) = 0 . · . · (0 . = 0 . P ( Y | ¯ A , ¯ A , ¯ A ) = 0 . · (0 . = 0 . . Fact B.4.

Assuming Assumption B.2 and the accuracy metric, the optimal intersectionally fair predictor ˆ Y assigns theprobabilities ∀ b ∈ { , } M , P ( ˆ Y | A b , . . . , A b M M ) = wmedian A  P ( Y ) M (cid:89) j =1 (cid:18) q j a j (cid:19) b j (cid:18) − q j − a j (cid:19) − b j  where the weighted median wmedian A of a set of M numbers { r b ≤ . . . ≤ r b M : b i ∈ { , } M } is r b i ∗ , i ∗ = min { i ∈ N : (cid:88) k ≥ i P ( A b k , . . . , A b kM M ) ≥ . } . (Proof sketch). By thinking about it (or taking subgradient of E | Y − ˆ Y | ), since we have the freedom to pick anyconstant to be the one to assign to every P ( ˆ Y | A b , . . . , A b M M ) , we get the weighted median formula. Fact B.5.

In example B.3, using Fact B.4 (an) optimal intersectionally fair predictor assigns P ( ˆ Y | A b , A b , A b ) =0 . and has an error of

18 ( | . − . | + 3 · | . − . | + | . − . | ) = 0 . . A P

REPRINT

On the other hand, an optimal independently group fair predictor assigns P ( Y | A , A , A ) = 0 . · (1 . = 0 . ∀ ≤ i, j, k ≤ , P ( Y | A i , A j , ¯ A k ) = 0 . · (1 . · . . ∀ ≤ i, j, k ≤ , P ( Y | A i , ¯ A j , ¯ A k ) = 0 . · . · (0 . = 0 . P ( Y | ¯ A , ¯ A , ¯ A ) = 0 . · (0 . = 0 . . This predictor has an error of ( | . − . | + | . − . | ) = 0 . . This is strictly less than the optimalintersectional error . , i.e. there is a gap. Proof.

576 + 2(0 . . . . Since i ∈ { , , } is arbitrary independent group fairness is satisﬁed. C Consistency and Generalization

Theorem 5.2.

With probability at least − δ , if projected gradient ascent is run ( Update t ( λ , v ) = proj [0 ,B ] J ( λ + ηv ) )for T iterations with step size η = B √ T and for t = 1 , . . . , T, h t = plugin(ˆ η , (ˆ π g ) g ∈G fair , ψ, Φ) , letting ρ =max {(cid:107) ψ (cid:107) , (cid:107) φ (cid:107) , . . . , (cid:107) φ J (cid:107) } , then U ψ (¯ h T ) ≤ U ∗ ψ + JB √ T + ((1 + J ) B + 1) ρ  (cid:115) K log(2 n min ) n min + (cid:115) log(2(1 + |G fair | ) K /δ ) n min  + E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) B  ρ X + (cid:88) g ∈G fair + ρ g π g  + 2 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g Bπ g (cid:107)V Φ (¯ h T ) (cid:107) ∞ ≤ J √ T + 4(4(1 + J ) + 1) ρ (cid:115) K log(2 n min ) n min + (cid:115) log(2( | |G fair | ) K /δ ) n g  + 4 E (cid:107) η ( x ) − ˆ η ( x ) (cid:107)  ρ X + (cid:88) g ∈G fair ρ g π g  + 8 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g π g . Proof.

First step is to extract the error incurred by plugging in ˆ η rather than η . Denoting ˆ h = plugin(ˆ η , (ˆ π g ) g , ψ, Φ , λ ) and n g = |{ i : x i ∈ g }| so that ˆ π g = n g n , ˆ h ( x ) = argmin k ∈{ ,...,K } (cid:26) ˆ η ( x ) (cid:62) (cid:20) D + J (cid:88) l =1 λ l (cid:0) U l − (cid:88) g ∈G fair x ∈ g ˆ π g V gl (cid:1)(cid:21)(cid:27) k . Denote h = plugin( η , ( π g ) g , ψ, Φ , λ ) . We quantify the discrepancy. Deﬁne ˆ k = ˆ h ( x ) and k ∗ = h ( x ) . Also, deﬁne M = D + J (cid:88) l =1 λ l (cid:18) U l − (cid:88) g ∈G fair x ∈ g ˆ π g V gl (cid:19) . A P

REPRINT ( η ( x ) (cid:62) M ) ˆ k − ( η ( x ) (cid:62) M ) k ∗ = (ˆ η ( x ) (cid:62) M ) ˆ k + [( η ( x ) − ˆ η ( x )) (cid:62) M ] ˆ k − ( η ( x ) (cid:62) M ) k ∗ ≤ (ˆ η ( x ) (cid:62) M ) k ∗ + [( η ( x ) − ˆ η ( x )) (cid:62) M ] ˆ k − ( η ( x ) (cid:62) M ) k ∗ + ξ = ( η − ˆ η ) (cid:62) M ( e ˆ k − e k ∗ ) + ξ ≤ (cid:107) η − ˆ η (cid:107)  (cid:88) g ∈G fair ρ g π g + ρ X  B + ξ where ρ g = (cid:80) Jl =1 (cid:107) V gl (cid:107) ∞ , ρ X = (cid:107) D (cid:107) ∞ + (cid:80) Jl =1 (cid:107) V l (cid:107) ∞ and ξ = 2 (cid:113) log( |G fair | /δ ) n (cid:80) g ∈G fair ρ g Bπ g – we are consideringthe fact that | π g − ˆ π g | ≤ (cid:113) log(2 |G fair | /n ) n for every g ∈ G fair with probability − δ/ . Taking expectation, we arrive at L ( C (ˆ h ) , λ ) − L ( C ( h ) , λ ) ≤ E (cid:107) η ( x ) − ˆ η ( x ) (cid:107)  (cid:88) g ∈G fair ρ g π g + ρ X  B + 2 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g Bπ g . (4)By standard subgradient descent/online learning analysis, if the stepsize η = 1 / ( B √ T ) is used, T max λ ∈ [0 ,B ] M T (cid:88) t =1 ˆ L ( h t , λ ) − T T (cid:88) t =1 ˆ L ( h t , λ t ) ≤ JB √ T because L ( h, · ) is concave and √ J -Lipschitz (all fairness violations assumed to be in [0 , ) and the (cid:96) radius of [0 , B ] J is √ JB .Now we show how good of a saddle point (cid:16) T (cid:80) Tt =1 h t , T (cid:80) Tt =1 λ t (cid:17) =: (¯ h T , ¯ λ T ) for the population problem. Byconvexity of L in the ﬁrst argument, T max λ ∈ [0 ,B ] M T (cid:88) t =1 ˆ L ( h t , λ ) ≥ max λ ∈ [0 ,B ] M ˆ L (¯ h T , λ ) . Using equation 4 and the fact that h t is the minimizer of L ( C [ h ] , λ t ) , but using ˆ η instead of η , T T (cid:88) t =1 ˆ L ( h t , λ t ) ≤ T T (cid:88) t =1 L ( h t , λ t ) + ˆ L ( h t , λ t ) − L ( h t , λ t ) ≤ T T (cid:88) t =1 min h : X → [0 , L ( h, λ t ) + ˆ L ( h t , λ t ) − L ( h t , λ t )+ B ( ρ X + (cid:88) g ∈G fair ρ g π g ) E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) + ξ ≤ min h : X → [0 , L ( h, λ T ) + 4(1 + J ) Bρ (cid:115) K log( K ) log(2 n min ) n min + (cid:115) log(2( | |G fair | ) K /δ ) n min  + B ( ρ X (cid:88) g ∈G fair ρ g π g ) E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) + ξ where the middle term is from Lemma D.1. Let us absorb the error terms into γ . Now we can write: max λ ∈ [0 ,B ] J ˆ L (¯ h T , λ ) − min h : X → [0 , L ( h, λ T ) ≤ JB √ T + γ. Letting ( h ∗ , λ ∗ ) be primal dual optimal, we have ∀ λ ∈ [0 , B ] K , L ( h ∗ , λ ∗ ) ≥ ˆ L (¯ h T , λ ) − JB √ T − γ. (5)The choices λ = 0 and λ = λ ∗ + B e g m , give ˆ U (¯ h T ) ≤ U ( h ∗ ) + γ + JB √ T ˆ V (¯ h T ) k ≤ B (cid:18) JB √ T + 2 γ (cid:19) . A P

REPRINT

By Lemma D.1 ∀ g ∈ G fair , sup h ∈H plg (cid:107) C g [ h ] − ˆ C g [ h ] (cid:107) ∞ ≤ (cid:115) K log(2 n g ) n g + (cid:115) log(2( | |G fair | ) K /δ ) n g =: ζ ( n g ) . we have that with probability ≥ − δ U (¯ h T ) ≤ U ( h ∗ ) + γ + JB √ T + ρζ ( n min ) V (¯ h T ) k ≤ B (cid:18) JB √ T + 2 γ (cid:19) + ρζ ( n min ) . Therefore we obtain the bounds U ψ (¯ h T ) ≤ U ∗ ψ + JB √ T + ((1 + J ) B + 1) ρ  (cid:115) K log(2 n min ) n min + (cid:115) log(2(1 + |G fair | ) K /δ ) n min  + E (cid:107) η ( x ) − ˆ η ( x ) (cid:107) B  ρ X + (cid:88) g ∈G fair + ρ g π g  + 2 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g Bπ g (cid:107)V Φ (¯ h T ) (cid:107) ∞ ≤ J √ T + 4(4(1 + J ) + 1) ρ (cid:115) K log(2 n min ) n min + (cid:115) log(2( | |G fair | ) K /δ ) n g  + 4 E (cid:107) η ( x ) − ˆ η ( x ) (cid:107)  ρ X + (cid:88) g ∈G fair ρ g π g  + 8 (cid:114) log( |G fair | /δ ) n (cid:88) g ∈G fair ρ g π g . D Estimators

In this section, we give plugin and weighted ERM methods of solving the linear probabilistic minimization problemsarising from the Lagrangian of our fairness problem. For clarity, we go over the choices of cost and constraint matricescorresponding to what we use in our experiments.In our experiments, we maximize accuracy while enforcing independent demographic parity constraints and group-weighted gerrymandering demographic parity constraints. Under the framework of our probabilistic optimizationproblem, the former corresponds to the choice G fair = G independent , and Φ containing the |G independent | = 4 M constraints ∀ g ∈ G independent , ± ( C g + , − C + , ) ≤ ν, where the + subscript denotes summing over indices , in place of + . I.e. for g ∈ G indepdendent , V gg, ± = ± (cid:20) (cid:21) , V g (cid:48) g, ± = for g (cid:54) = g (cid:48) , U g, ± = ± (cid:20) (cid:21) . D = (cid:20) (cid:21) .The latter corresponds to the choice G fair = G gerrymandering , and the |G gerrymandering | constraints ∀ g ∈ G gerrymandering , ± P ( g )( C g + , − C + , ) ≤ ν. This corresponds to, for g ∈ G gerrymandering , V gg, ± = ± P ( g ) (cid:20) (cid:21) , V g (cid:48) g, ± = for g (cid:54) = g (cid:48) , U g, ± = ± P ( g ) (cid:20) (cid:21) .The P ( g ) ’s will cancel out with the P ( g ) ’s in the expressions below.16airness with Overlapping Groups A P

REPRINT

D.1 Plugin Estimator

Using linearity of ψ and φ , if η is known, the population minimizer h ∗ = argmin h : X → [ K ] L ( h, λ ) is deterministic andhas a convenient closed form solution (the same is true of any linear minimization). L ( h, λ ) = (cid:104) D + L (cid:88) l =1 λ l U l , C [ h ] (cid:105) − (cid:88) g ∈G fair L (cid:88) l =1 λ l (cid:104) V gl , C g [ h ] (cid:105) = E (cid:26) (cid:104) D + L (cid:88) l =1 λ l U l , η ( x ) h ( x ) (cid:62) (cid:105) − (cid:88) g ∈G fair L (cid:88) l =1 λ l (cid:10) V gl , { x ∈ g } P ( g ) η ( x ) h ( x ) (cid:62) (cid:11)(cid:27) = E η ( x ) (cid:62) (cid:2) D + L (cid:88) l =1 λ l (cid:18) U l − (cid:88) g ∈G fair x ∈ g P ( g ) V gl (cid:19)(cid:3) h ( x ) . where we noticed that the conditional group confusion equals C g [ h ] = E { x ∈ g } η ( x ) h ( x ) (cid:62) / P ( g ) . Denote π g = P ( g ) for g ∈ G fair as the group probabilities. Thus, the minimizer has the deterministic form h ∗ ( x ) = argmin k ∈{ ,...,K } (cid:26) η ( x ) (cid:62) (cid:2) D + L (cid:88) l =1 λ l (cid:18) U l − (cid:88) g ∈G fair x ∈ g P ( g ) V gl (cid:19)(cid:3)(cid:27) k . (6)Finally, since we do not actually have access to the true η , we replace η with an estimated ˆ η . D.2 Weighted ERM

In the weighed ERM approach (referred to as cost-sensitive classiﬁcation for the binary case (Agarwal et al., 2018))we parametrize h : X → [ K ] by a function class F of functions : X → R K . The classiﬁcation is the argmax ofthe predicted vector, h ( x ) = argmax j ( f ( x ) j ) , so we denote the set of classiﬁers as H werm = argmax ◦F . For astandard classiﬁcation problem with 0-1 error, minimizing the dataset error (cid:99) err[ h ] = n (cid:80) ni =1 { h ( x i ) (cid:54) = y i } is done byminimizing a surrogate loss (cid:96) : R K × [ K ] → R + , e.g., using softmax cross-entropy, over the dataset, as ˆ E (cid:96) ( f ( x ) , y ) = n (cid:80) ni =1 (cid:96) ( f ( x i ) , y i ) . Then we take h = argmax ◦ f .Let (cid:96) ( s ) ∈ R k be the vector (cid:96) ( s ) k = (cid:96) ( s , k ) .In an analogous manner, we would like to minimize the empirical metric deﬁned by the Lagrangian using a surrogateloss, as min h ∈H werm ˆ L ( h, λ ) = n (cid:88) i =1 e (cid:62) y i (cid:20) n D + L (cid:88) l =1 λ l n (cid:18) U l − (cid:88) g ∈G fair x i ∈ g n g V gl (cid:19)(cid:21) h ( x i ) . where n g = |{ i : x i ∈ g }| , g ∈ G fair are the empirical sizes of each group. Notice it has the form min h ∈H werm n (cid:88) i =1 w (cid:62) i h ( x i ) = n (cid:88) i =1 s ( w i ) w (cid:62) i s ( w i ) h ( x i ) , s ( w i ) = 1 n − K (cid:88) k =1 ( w i ) k . If we interpret − w i s ( w i ) as a probability distribution over labels and s ( w i ) as its weight, then we have min h ˜ E [( − ˜ η ( x )) (cid:62) h ( x )] where ˜ P ( x i ) = s ( w i ) (cid:80) ni =1 s ( w i ) and ˜ η ( x i ) = − w i s ( w i ) .A priori, max k ( w i ) k s ( w i ) ≤ , i.e. max k ( w i ) k (cid:80) Kk =1 ( w i ) k ≤ n − , may not hold. But, since shifting each entry of w i by the same amountdoes not change the initial optimization problem, we can add the constant amount ( n −

1) max k ( w i ) k − (cid:80) Kk =1 ( w i ) k to each entry of w i , after which w i s ( w i ) ≤ .If (cid:96) is a surrogate loss used to minimize the multiclass error, it is assumed that we can minimize E [(1 − η ( x )) h ( x ) ] ifwe minimize E [ η ( x ) (cid:62) (cid:96) ( f ( x ))] and take h = argmax ◦ f . Therefore, we can solve the weighted version by minimizingreweighted surrogate loss: min f ∈F ˜ E [˜ η ( x ) (cid:62) (cid:96) ( f ( x ))] ≡ min f ∈F n (cid:88) i =1 s ( w i ) (cid:18) − w i s ( w i ) (cid:19) (cid:62) (cid:96) ( f ( x )) =: ˆ L ( f ) . (7)This provides a convex surrogate for the original problem of minimizing the empirical Lagrangian.17airness with Overlapping Groups A P

REPRINT

Lemma D.1 (Confusion matrix generalization) . Denote n g as the number of samples belonging to group g for g ∈ G fair ∪ {X } . Then with probability at least − δ , ∀ g ∈ G fair ∪ {X } , sup h ∈ conv H (cid:107) C g [ h ] − (cid:98) C g [ h ] (cid:107) ∞ ≤ (cid:115) VC( H ) log( n g + 1) n g + (cid:115) log((1 + |G fair | ) K /δ ) n g . Proof.

By standard binary classiﬁcation generalization (Boucheron, Stéphane et al., 2005), with probability at least − δ , sup h ∈ conv H (cid:12)(cid:12)(cid:12) P ( Y = i, h ( X ) = j | g ) − ˆ P ( Y = i, h ( X ) = j | g ) (cid:12)(cid:12)(cid:12) ≤ (cid:115) VC( H ) log( n g + 1) n g + (cid:115) log(1 /δ ) n g . Then we take a union bound over |G fair | confusion matrices and K entries per confusion matrix. Theorem D.2.

Suppose ψ : [0 , K × K → [0 , and Φ : [0 , K × K × ([0 , K × K ) G fair → [0 , L are ρ -Lipschitz w.r.t. (cid:107) · (cid:107) ∞ . Recall ˆ L ( h , λ ) = ˆ E ( h ) + λ (cid:62) ( ˆ V ( h ) − ε ) . Let γ denote the bound in Lemma D.1 that applies to C , γ g thebound that applies to C g , and denote γ G fair = max g ∈G fair γ g . If ε ≥ ργ then with probability − δ :If (¯ h , ¯ λ ) is a ν -saddle point of max λ ∈ [0 ,B ] L min h ∈ conv H ˆ L ( h , λ ) , in the sense that max λ ∈ [0 ,B ] L ˆ L (¯ h , λ ) − min h ∈ conv H ˆ L ( h , ¯ λ ) ≤ ν , and h ∗ ∈ conv H satisﬁes V ( h ∗ ) ≤ , then E (¯ h ) ≤ E ( h ∗ ) + ν + 2 ργ (8) (cid:107)V (¯ h ) (cid:107) ∞ ≤ νB + ργ G fair + ε. (9)Thus, as long as we can ﬁnd an arbitrarily good saddle point, which follows from weighted ERM if H werm is expressiveenough while having ﬁnite VC dimension, then we obtain consistency. Proof.

By Lemma D.1, with probability − δ, |E ( h ) − ˆ E ( h ) | ≤ ργ, (cid:107)V ( h ) − ˆ V ( h ) (cid:107) ∞ ≤ ργ G fair . (10)Therefore, ˆ V ( h ∗ ) ≤ ε . Using this feasibility to argue the ﬁrst inequality below: ˆ E (¯ h ) − ˆ E ( h ∗ ) ≤ ˆ E (¯ h ) − ˆ L ( h ∗ , ¯ λ ) = ˆ L (¯ h , − ˆ L ( h ∗ , ¯ λ ) ≤ ν. Then (8) follows from (10) and triangle inequality. For the next part, B ( ˆ V (¯ h ) k − ε ) = ˆ L (¯ h , Be k ) − ˆ L ( h ∗ , ¯ λ ) + ˆ E ( h ∗ ) − ˆ E ( h ) ≤ ν + 1 . This and (10) imply (9).

E Datasets

Here we dicsuss the datasets used and additional experimental details.

Communities and Crime: contains neighborhoods featurized by various statistics pertaining to the neighborhoods,e.g. percent employed in various professions, demographics, rent, etc. The label is whether there is a high ( > -ile)rate of violent crimes per capita. There are n = 1994 samples and N = 12 protected attributes comprising variousracial statistics. Adult census: contains census data for n = 2020 individuals. The label is whether an individual has high income. N = 7 protected attributes comprising age, sex, and different races. German credit: (Dua and Graff, 2017) contains features such as ﬁnancial holdings, occupation, housing, and reasonfor purchases, and the goal is to predict whether an individual has good credit. Several categorical variables wereconverted to one-hot encodings. There are n = 1000 examples and N = 3 protected attributes corresponding to age,sex, and foreign worker status. Law school: contains n = 1823 students and their gpas, cluster, and LSAT score. The goal is to predict whether thestudent passes the bar, and the protected attributes are age, gender, and family income.18airness with Overlapping Groups A P

REPRINT

For the constraint level ν we vary according a logarithmically spaced grid from . to 1 with 20 points. We set B = 50 for the GroupFair methods. We vary the regularization parameter ρ from . /M to /M across alogarithmically spaced grid with 20 points.The authors of (Kearns et al., 2018) apply ﬁctitious play to the gerrymandering problem, searching for the mostviolated constraint max g ∈G fair n g n | C g , + C g , − C , − C , | in response to the average of the predictors computedso far (if the violation exceeds ν ), and computing the minimizing predictor in response to the average of the dualvariables obtained from the most violated constraints so far. On the other hand, we directly apply our GroupFair framework to their original cost function (seeKearns et al. (2018)) i.e., the problem of maximizing accuracy subject to ∀ g ∈ G fair , | g | n | C g , + C g , − C , − C , | ≤ ν . Both approaches aim to solve this problem.Here are the full (training in addition to test) plots for the independent and gerrymandering experiments.Figure 3: Experiments on independent group fairness. The pareto frontier closest to the bottom left represent the bestfairness/performance tradeoff. 19airness with Overlapping Groups A P