aa r X i v : . [ c s . C Y ] J a n Group Fairness: Independence Revisited
Tim Räz
Institute of PhilosophyUniversity of BernSwitzerlandInstitute of Biomedical Ethics and History of MedicineUniversity of Zü[email protected]
ABSTRACT
This paper critically examines arguments against independence , ameasure of group fairness also known as statistical parity and as demographic parity . In recent discussions of fairness in computerscience, some have maintained that independence is not a suitablemeasure of group fairness. This position is at least partially basedon two influential papers (Dwork et al., 2012, Hardt et al., 2016)that provide arguments against independence. We revisit these ar-guments, and we find that the case against independence is ratherweak. We also give arguments in favor of independence, showingthat it plays a distinctive role in considerations of fairness. Finally,we discuss how to balance different fairness considerations.
CCS CONCEPTS • Social and professional topics → User characteristics ; •
Com-puting methodologies → Machine learning ; •
Applied comput-ing → Arts and humanities . KEYWORDS fairness, independence, statistical parity, demographic parity, suf-ficiency, separation, affirmative action, accuracy
ACM Reference Format:
Tim Räz. 2021. Group Fairness: Independence Revisited. In
ACM Conferenceon Fairness, Accountability, and Transparency (FAccT ’21), March 1–10, 2021,Virtual Event, Canada.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3442188.3445876
Measures of group fairness have become an important topic incomputer science after the publication of the ProPublica article“Machine Bias” [1]. ProPublica found that the risk assessment toolCOMPAS is biased against black people in having unbalanced falsepositive and false negative rates. This is intuitively unfair. The en-suing debate mostly focused on the contrast between the measureimplicitly used by ProPublica, now known as separation , and othermeasures, in particular a measure known as sufficiency . However,
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanthe author(s) must be honored. Abstracting with credit is permitted. To copy other-wise, or republish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].
FAccT ’21, March 1–10, 2021, Virtual Event, Canada © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8309-7/21/03...$15.00https://doi.org/10.1145/3442188.3445876 a third measure of group fairness, independence , also known as sta-tistical parity or demographic parity , has been viewed more criti-cally. Some computer scientists seem to think that independenceis not a suitable measure of group fairness [3, 12]; others main-tain that while independence is adequate in some contexts, it leadsto undesirable consequences in others [4, 17]. The critical stanceof computer scientist with respect to independence appears to beat least partially based on two influential papers [7, 8] that pro-vide arguments against independence. Here we revisit and criti-cally examine these arguments, and we find that the case againstindependence as opposed to other measures of group fairness israther weak.We first introduce measures of group fairness and their mostimportant properties (section 2). In particular, we introduce theconcept of conservative fairness measures, which allows us to clar-ify the relation between fairness and accuracy. We then examinethe arguments against independence (section 3). We find that, first,arguments against independence proposed in [7] equally apply toother measures of group fairness such as sufficiency and separa-tion, and should therefore not be taken to apply to independencespecifically. Second, we argue that arguments against independenceproposed in [8] are flawed in making unwarranted assumptionsabout conservative fairness measures such as sufficiency and sepa-ration. We prove that sufficiency and separation are not incremen-tally conservative , which means that these measures are not nec-essarily preserved if we increase the accuracy of a predictor. Wethen state arguments in favor of independence (section 4), findingthat independence captures aspects of fairness not covered by suffi-ciency and separation. Finally, we discuss how to balance differentfairness considerations (section 5). This section introduces and discusses the most important measuresof group fairness, formulates these measures for the case of binaryvariables, and discusses other relevant fairness measures, settingthe stage for the discussion in later sections.
Here we state the most important group fairness measures, follow-ing the discussion in [2]. These measures are formulated using ran-dom variables 𝑌 , 𝑅 , 𝐴 ; all measures we consider correspond to sta-tistical properties of these variables. The variables have the follow-ing interpretation: 𝑌 is the “true label”, i.e., the characteristic that AccT ’21, March 1–10, 2021, Virtual Event, Canada Tim Räz we want to predict; 𝑅 is the prediction, which can be the output ofan algorithm; 𝐴 is the characteristic indicating group membership,i.e., the property with respect to which we investigate fairness. Inthe context of supervised learning, we have access to 𝑌 throughlabeled data. We will mostly focus on binary variables. Also, wewill assume that a prediction 𝑅 leads to a corresponding decision.Let us illustrate this setup using the example of college admis-sions. College students of different genders apply for college; a pre-diction about their suitability is made based on the application doc-uments. In this case, the value of 𝑌 corresponds to the actual suit-ability of a student applying for college. 𝑌 is known if the data inquestion is historical, and the value of 𝑌 can be determined basedon whether a student actually obtained a degree or not (or a differ-ent operationalization of ‘suitable applicant’). 𝑅 corresponds to theprediction whether or not a student should be admitted to collegebased on the application documents. 𝐴 corresponds to the genderof applicants, which we assume to be binary for simplicity’s sake.To formulate fairness measures, we will use the following no-tation: Two random variables 𝑋, 𝑌 are independent if 𝑃 ( 𝑋, 𝑌 ) = 𝑃 ( 𝑋 )· 𝑃 ( 𝑌 ) ; we will write this as 𝑋 ⊥ 𝑌 . Two random variables 𝑋, 𝑌 are conditionally independent given 𝑍 if 𝑃 ( 𝑋 | 𝑌, 𝑍 ) = 𝑃 ( 𝑋 | 𝑍 ) ;we will write this as 𝑋 ⊥ 𝑌 | 𝑍 . Definition 1.
The measure of indepencende is satisfied if 𝑅 ⊥ 𝐴 .Independence, also known as statistical parity and demographicparity, means that the prediction 𝑅 does not depend on 𝐴 . If inde-pendence is satisfied, a prediction is statistically balanced betweendifferent groups, in that members of the different groups get pre-dictions at the same rate. In the case of college admissions, thismeans that an equal proportion of men and women applying forcollege are predicted to be suitable applicants. Definition 2.
The measure of sufficiency is satisfied if 𝑌 ⊥ 𝐴 | 𝑅 .Sufficiency means that, given the prediction, the true label is in-dependent of the group. The idea is that the prediction 𝑅 containsall the information about the true label, so the sensitive character-istic is not needed; in other words, the prediction 𝑅 is sufficientfor 𝑌 . In the case of college admissions, this means that an equalproportion of men and women predicted to be suitable applicantsare actually suitable applicants. Definition 3.
The measure of separation is satisfied if 𝑅 ⊥ 𝐴 | 𝑌 .Separation means that, given the true label, the prediction is in-dependent of the group. The idea is that the prediction 𝑅 can onlyvary with respect to different groups 𝐴 insofar as this is justifiedby the true label 𝑌 ; see [2]. In the case of college admissions, thismeans that an equal proportion of suitable men and women apply-ing for admission are predicted to be suitable applicants. In this section, we discuss some important properties of the fair-ness measures introduced above. The discussion follows [2]; seethe appendix for proofs. First, the accuracy of a predictor 𝑅 is thedegree to which it agrees with the true label 𝑌 ; a perfect predictor Sufficiency is closely related to calibration . Calibration means that the predicted storereflects the true score. Calibration and sufficiency are equivalent up to reparametriza-tion, cf. [2, p. 52]. is a predictor that completely agrees with the true label, i.e., 𝑌 = 𝑅 .Next, we state an important property that is shared by sufficiencyand separation, but not by independence. Proposition 4.
If we have a perfect predictor, then sufficiency andseparation hold.Independence, 𝑅 ⊥ 𝐴 , is not, in general, compatible with a per-fect predictor: independence and perfect predictors are only com-patible if the true label is evenly distributed between groups, i.e.,if we have 𝑌 ⊥ 𝐴 , which is not the case in general. Proposition 4motivates a definition that will be important in the following. It isa distinction between different kinds of group fairness measures: Definition 5.
A fairness measure is conservative if the measure isnecessarily satisfied in the case of a perfect predictor. Otherwise, afairness criterion is non-conservative .Fairness measures are called conservative because, in the caseof a perfect predictor, they do not force us to change anything toobtain fairness, i.e., they conserve the status quo . Proposition 4shows that both sufficiency and separation are conservative fair-ness criteria; meanwhile, independence is not. Note that the re-verse implication of proposition 4 is false. The following proposi-tion provides a characterization of when sufficiency and separationholds in some cases of non-perfect predictors:
Proposition 6.
If the joint distribution of ( 𝐴, 𝑌, 𝑅 ) is positive forall values, then sufficiency and separation hold at the same timeiff. 𝐴 is independent of the joint distribution of 𝑌 and 𝑅 , i.e., if 𝐴 ⊥ ( 𝑌, 𝑅 ) .This proposition is important because it tells us when sufficiencyand separation hold under reasonable circumstances such as a non-vanishing joint distribution.The notion of a conservative fairness measure is very strongand of limited practical relevance, because predictors are hardlyever perfect in practice. To overcome this limitation, we can definea broader notion of conservativeness and investigate if relevantfairness measures are conservative on this notion: Definition 7.
A fairness measure is incrementally conservative ifthe degree to which the measure is satisfied does not decrease ifwe increase the accuracy of the predictor.Are the fairness measures we considered above conservative ac-cording to this broader notion? Unfortunately, this is not the case.In the appendix, the following proposition is proved:
Proposition 8.
Sufficiency and separation are not incrementallyconservative fairness measures.This means that if these two measures are satisfied by a certain(non-perfect) predictor, and we increase the accuracy of that pre-dictor, it can happen that the improved predictor no longer satisfiesthe two measures. Thus, the property of conservativeness does notimply incremental conservativeness; it is not necessarily the casethat if we increase the accuracy of a predictor, a conservative fair-ness measure is preserved. The notion of a conservative fairness measure used here is related to the concept of conservative justice [14, Sec. 2.1.] insofar as the latter notion concerns the preservationof (factual) practices; however, the notion of conservativeness proposed here does notconcern the preservation of norms as required by conservative justice.2 roup Fairness: Independence Revisited FAccT ’21, March 1–10, 2021, Virtual Event, Canada
In this section, we discuss group fairness measures in the case ofbinary prediction 𝑅 , ground truth 𝑌 , and characteristic 𝐴 , using so-called confusion matrices. Confusion matrices make it easier to for-mulate and reason about these measures in concrete applications.The discussion in this section draws on more thorough expositionsof confusion matrices and their characteristics in [3, 13].Assume we have collected statistical information for binary 𝑌 and 𝑅 , for example, historical records of college success 𝑌 , as wellas the binary prediction for admission 𝑅 . The prediction 𝑅 can bepositive or negative, and for either outcome, it can match the truelabel 𝑌 ( 𝑎 = 𝑑 = 𝑏 = 𝑐 = 𝑎 𝑏 𝑎 + 𝑏 negative 𝑐 𝑑 𝑐 + 𝑑 total 𝑎 + 𝑐 𝑏 + 𝑑 𝑁 On this basis, we can define some important statistics of confu-sion matrices: • Accuracy: 𝑎 + 𝑑𝑁 • Positive Predictive Value (PPV): 𝑎𝑎 + 𝑏 • Negative Predictive Value (NPV): 𝑑𝑐 + 𝑑 • False Positive Rate (FPR): 𝑏𝑏 + 𝑑 • False Negative Rate (FNR): 𝑐𝑎 + 𝑐 From here on, we assume that we have observed sufficientlymany cases such that the observations (relative frequencies) in ourtables approximately match the “true probabilities”. Now, in orderto formulate fairness measures for confusion matrices, we needone matrix for each of two groups 𝑝, 𝑞 (values of the random vari-able 𝐴 ): truth (Y)+ –pred. (R) + 𝑎 𝑏 – 𝑐 𝑑 Table 1: Group A=p truth (Y)+ –pred. (R) + 𝑎 ′ 𝑏 ′ – 𝑐 ′ 𝑑 ′ Table 2: Group A=q
Based on this, the three group fairness measures defined in theprevious section can be formulated in terms of statistics of thesetwo confusion matrices:
Proposition 9.
For binary variables
𝑌, 𝑅, 𝐴, independence is equiv-alent to: 𝑎 + 𝑏𝑁 = 𝑎 ′ + 𝑏 ′ 𝑁 ′ . Proposition 10.
For binary variables
𝑌, 𝑅, 𝐴, sufficiency holds iff.both groups have the same positive predictive value (PPV), i.e., 𝑎𝑎 + 𝑏 = 𝑎 ′ 𝑎 ′ + 𝑏 ′ and the same negative predictive value (NPV), i.e., 𝑑𝑐 + 𝑑 = 𝑑 ′ 𝑐 ′ + 𝑑 ′ . Proposition 11.
For binary variables
𝑌, 𝑅, 𝐴, separation holds iff.both groups have the same false positive rate (FPR), i.e., 𝑏𝑏 + 𝑑 = 𝑏 ′ 𝑏 ′ + 𝑑 ′ and false negative rate (FNR), ie, 𝑐𝑎 + 𝑐 = 𝑐 ′ 𝑎 ′ + 𝑐 ′ . In this section, we discuss kinds of fairness that are not group fair-ness measures, but that will play an important role in our discus-sion below. The group fairness measures introduced above are ob-servational measures, i.e., they can be measured based on data thatare typically available: In may cases, we have access to labeled data( 𝑌 ), a predictive model ( 𝑅 ), and labels or a different kind of accessto the sensitive characteristic of individuals, ( 𝐴 ). However, thereare other kinds of fairness considerations that are not measurablein terms of these quantities.Kamishima et al. [10] propose to distinguish three different kindsof fairness. The first kind, prejudice, subsumes the notions of groupfairness discussed above. The second kind, underestimation, is dueto the fact that a model may be unfair due to the finiteness of train-ing data. The definition of the third kind, negative legacy, is partic-ularly important: Definition 12.
Negative legacy is unfairness due to unfair sam-pling or labeling in the training data.Kamishima et al. provide the following example of negative legacy:“[I]f a bank has been unfairly rejecting the loans of the people whoshould have been approved, the labels in the training data wouldbecome unfair. This problem is serious because it is hard to detectand correct” (Ibid., p. 646). Kamishima et al. note that the problemcan be overcome to a certain extent if an independent set of fairlylabeled training data is available.A further notion of fairness that is relevant is individual fairness,which can be defined as follows:
Definition 13. (Informal)
Individual fairness is the requirementthat a fair predictor should treat similar individuals similarly, i.e.,their predictions should be similar. Individual fairness is in tension with group fairness measuresunder certain conditions because group fairness defines fairnessin terms of (average) properties of group members, which usu-ally does not do justice to some individual properties of groupmembers. In particular, if the comparison of individuals uses fine-grained information such as a score or utility, it is possible to vi-olate individual fairness while complying with some measure ofgroup fairness. We will see examples of this below. Finally, notethat individual fairness seem to be the fairness measure that ismost closely related to the philosophical concept of justice, cf. [14,Sec. 1.1.]
In this section, we examine arguments against independence fromthe computer science literature. When following the debate, onecan get the impression that independence is somehow flawed orunsuitable as a measure for group fairness. The goal of this section The notion is due to [7]. Formally, we can make individual fairness precise by replac-ing the informal notion of “similarity” with two metrics, which capture how similaror close individuals and their predictions are. To do this, we need a metric 𝑑 betweenindividuals 𝑥, 𝑦 ∈ 𝐼 , and a metric 𝐷 between distributions of predictions 𝑀𝑥, 𝑀𝑦 of individuals, where 𝑀 is a map from individuals 𝐼 to distributions of predictions.To enforce individual fairness, we now require that the distance between individualsshould limit the distance between the distribution of predictions, i.e., we should have 𝐷 ( 𝑀𝑥, 𝑀𝑦 ) ≤ 𝑑 ( 𝑥, 𝑦 ) for 𝑥, 𝑦 ∈ 𝐼 . This is a so-called Lipschitz condition.3 AccT ’21, March 1–10, 2021, Virtual Event, Canada Tim Räz is to revisit and critically examine important arguments againstindependence.
The most important paper credited with showing that indepen-dence is not a suitable fairness concept is [7]. We will reexaminethe arguments in this paper and argue that, first, it is not clearwhether Dwork et al. wish to reject independence, and second, thatthe arguments made by Dwork et al. should not be construed asarguments against independence, but more broadly as argumentsagainst group fairness in general.First, let us examine whether Dwork et al. wish to simply rejectindependence. Note that Dwork et al. call independence “statisticalparity”. In the introduction, Dwork et al. write: “we demonstrate[the inadequacy of statistical parity] as a notion of fairness throughseveral examples in which statistical parity is maintained, but fromthe point of view of an individual, the outcome is blatantly unfair.”(p. 2) This sounds like an outright rejection. However, later in thepaper, Dwork et al. write that “statistical parity is insufficient asa general notion of fairness.” (p. 7) This suggest that Dwork et al.merely want to argue that independence (statistical parity) is not alogically sufficient condition for fairness, which is a much weakerclaim than the claim that it should be rejected tout court . Whatis more, the paper investigates to what extent individual fairnessimplies, or helps satisfy, independence, which would be unneces-sary if independence should be outright rejected. Thus, Dwork etal. merely caution against independence as the sole arbiter of fair-ness.Now let us examine the arguments against independence in [7,Sec. 3.1] more closely. The arguments take the form of three exam-ples, in which the adoption of independence has undesirable conse-quences, in that independence holds, but individuals are treated un-fairly, that is, individual fairness is violated. The first example, ‘Re-duced Utility’, shows that independence does not ensure that themost suitable candidates from different group are selected. In theexample, an organization hires people from two groups 𝑝, 𝑞 ∈ 𝐴 .It is possible for the organization to comply with independencewhile, out of ignorance, choosing the least qualified members ofgroup 𝑝 and the best qualified members of group 𝑞 . This reducesthe utility of the organization, and it also violates individual fair-ness, because similar members of the two groups are treated differ-ently. To make it concrete, assume we have two individuals 𝑥 ∈ 𝑝 and 𝑦 ∈ 𝑞 , both similarly qualified, and while 𝑦 is hired, 𝑥 is nothired; thus two individuals who are similar are not treated simi-larly. The second example, ‘Self-fulfilling Prophecy’, has the samestructure as the first example, but the unqualified members of 𝑝 are now maliciously chosen for the purpose of justifying futurediscrimination of members of 𝑝 . The third example, ‘Subset Tar-geting’, is based on the fact that independence does not ensure afair choice within groups, in that it does not require that the mostdeserving members of a group get to see a relevant job ad. Thisimplies, once more, a violation of individual fairness.Are these examples by Dwork et al. sufficient to reject indepen-dence as a criterion of group fairness? We grant that independence can decrease utility, if utility depends on the degree of accuracy, be-cause independence does not depend on 𝑌 , and thus also not on ac-curacy, i.e., how well 𝑌 and 𝑅 match. The three examples discussedby Dwork et al. are all based on the fact that independence only re-quires that an equal proportion of two groups get classified in acertain way, but does not further specify how individuals withinthese groups have to be distributed with respect to 𝑌 . As a conse-quence, independence does not, in general, guarantee individualfairness.We thus grant that these examples are valid in substance. How-ever, this is not sufficient to reject independence as opposed toother measures of group fairness such as separation and sufficiency,because similar arguments can be directed against these other mea-sures. Our argumentative strategy is to make a tu quoque argu-ment: If one accepts these arguments against independence, thenone also has to accept similar arguments against sufficiency andseparation.We will now give a first example of gerrymandering [11] withseparation, in which an employer manipulates statistics so as torealize an unequal treatment of groups, while maintaining separa-tion. Example 14.
Gerrymandering With Separation:
A malicious em-ployer makes hiring decisions. There are two groups, 𝑝 and 𝑞 , forwhich separation has to be enforced. The employer has a prefer-ence for people from group 𝑞 . Assume that the employer has madeprovisional hiring decisions that satisfy separation, and a confu-sion matrix according to these hiring decisions has been compiled,cf. tables 1 and 2. The confusion matrices satisfy separation, whichimplies that we have 𝑎𝑎 + 𝑐 = 𝑎 ′ 𝑎 ′ + 𝑐 ′ . Assume further that the em-ployer has a reservoir of 𝑧 qualified candidates from group 𝑞 thatdo not appear in the statistic of the confusion matrix. The employercan now hire candidates from that reservoir, as long as an appropri-ate proportion of qualified candidates is rejected, i.e., the employercreates a division 𝑧 = 𝑧 + + 𝑧 − into qualified people hired 𝑧 + and qual-ified people rejected 𝑧 − , such that 𝑎 ′ + 𝑧 + 𝑎 ′ + 𝑐 ′ + 𝑧 = 𝑎 ′ 𝑎 ′ + 𝑐 ′ . The new confu-sion matrices are unchanged except for the entries 𝑎 ′ → 𝑎 ′ + 𝑧 + and 𝑐 ′ → 𝑐 ′ + 𝑧 − , which means that the matrices still satisfy separa-tion, as can be easily verified. However, this hiring practice seemsto be intuitively unfair towards group 𝑝 (and the qualified peoplefrom the reservoir 𝑧 who are rejected); it can also violate individualfairness because equally suitable candidates from group 𝑝 are noteven considered.It could be objected that this example is not analogous to theexamples by Dwork et al. in that here, the employer has to keeppart of the statistic “off the book”, and add it later. However, theexamples by Dwork et al. also make the assumption that there isadditional information not captured by the variables relevant toindependence. It is important to note that this (malicious) hiringpractice would not be possible in the case of independence, becausein the example, the employer drives up the number of employeesfrom group 𝑞 , without raising the numbers in the other group, andindependence enforces exact balance between groups.Let us give a second example, with a different structure. This isan example for gerrymandering with sufficiency and separation. roup Fairness: Independence Revisited FAccT ’21, March 1–10, 2021, Virtual Event, Canada Example 15.
Gerrymandering With Sufficiency and Separation:
As-sume that an employer has made provisional hiring decisions andcompiled two confusion matrices. Assume that the confusion ma-trices for 𝑝 and 𝑞 both do not have any zero entries and that thetwo confusion matrices have the same entries if they are normal-ized by 𝑁 and 𝑁 ′ respectively; this means that the correspondingjoint distribution of ( 𝑌, 𝑅, 𝐴 ) is positive everywhere, and the jointdistribution of ( 𝑌, 𝑅 ) is independent of group membership 𝐴 . Inthis case, sufficiency and separation are both satisfied accordingto proposition 6. Now, the employer wants to hurt group 𝑝 . Undercertain circumstances, this can be done as follows. The employerchooses a member 𝑥 of 𝑝 that is a false negative, i.e., 𝑥 should behired, but is not predicted to be hired. If there is a member 𝑥 ∗ ingroup 𝑝 that is a true positive, i.e., should be hired, and is predictedto be hired, and is more suitable for the job than 𝑥 , the maliciousemployer can simply switch the prediction for 𝑥 and 𝑥 ∗ , such that 𝑥 becomes a true positive and 𝑥 ∗ becomes a false negative. The twoconfusion matrices are unchanged and thus still satisfy sufficiencyand separation. However, individual fairness is violated, becausea less qualified candidate has been chosen over a more qualifiedcandidate, which hurts the group.Note that this example also works without assuming (malicious)intent. The situation described arises naturally if an employer hasless information about the relative suitability of people from group 𝑝 ; the employer would then hire a less than optimal selection ofcandidates from group 𝑝 and thus also not maximize utility. Thisis very similar to the first example by Dwork et al.. We can con-clude that examples of gerrymandering that are similar to thoseby Dwork et al., can be constructed for notions of group fairnesssuch as separation and sufficiency. Note that other examples ofgerrymandering for sufficiency and separation can be constructedalong the lines of the examples given here.It could be asked why Dwork et al. did not appreciate that theirarguments apply to other notions of group fairness as well. Here isa plausible explanation: Except for independence, notions of groupfairness, such as sufficiency and separation, were only discussedmore widely after the publication of the seminal ProPublica article[1], which appeared four years after the publication of [7]. Thus,Dwork et al. might have raised their objections against notionsof group fairness in general, and not just against independence, ifthey had been aware of other notions. It should also be noted thatthe primary focus of Dwork et al. is on the notion of individual fair-ness, not on group fairness. The paper does examine the relationbetween individual fairness and independence (statistical parity),but this is not the center of attention. All in all, the idea that inde-pendence is somehow more problematic than other group fairnesscriteria, which is held in parts of the computer science literatureand can at least partially be traced back to Dwork et al., may be ahistorical accident. Other arguments in the computer science literature are targetedmore specifically at independence and do not apply to other no-tions of group fairness such as sufficiency and separation. In [8], [2] mention that argument against independence may apply to other statistical fair-ness measures as well, but this is not elaborated. the authors propose separation as a fairness measure. To differenti-ate separation from independence, the authors claim that indepen-dence is flawed for two reasons that do not apply to separation. Thefirst reason is the kind of argument made in [7]. The second reasonwhy independence is flawed is given in the following quote – notethat the authors call independence “demographic parity”, and thepredictor is denoted ˆ 𝑌 :... demographic parity often cripples the util-ity that we might hope to achieve. Just imag-ine the common scenario in which the targetvariable 𝑌 – whether an individual actually de-faults or not – is correlated with 𝐴 . Demographicparity would not allow the ideal predictor ˆ 𝑌 = 𝑌 , which can hardly be considered discrimina-tory as it represents the actual outcome. As aresult, the loss in utility of introducing demo-graphic parity can be substantial. [8, p. 2]Later, the authors note that separation does not have this prob-lem: “Unlike demographic parity, our notion always allows for theperfectly accurate solution [...]” (Ibid.) We will now reconstruct theargument in this passage based on the distinctions made in section2. There are two different readings of the argument. The first read-ing focuses on the relation between accuracy and utility, while thesecond reading focuses on the relation between accuracy and fair-ness. On the first reading, the argument can be reconstructed asfollows:P1 A perfect predictor maximizes utility.P2 Independence is a non-conservative fairness criterion (is notgenerally compatible with a perfect predictor), while separa-tion is a conservative fairness criterion (is compatible witha perfect predictor).C1 Therefore, independence is not generally compatible withmaximal utility, while separation is.C2 Therefore, separation should be preferred over independence.There are two main problems with this argument. The first prob-lem is premiss 1: It is not the case that accuracy and utility alignnecessarily; see [5]. For one, accuracy only captures the state of theworld as it is at a certain point in time. Thus, if we maximize accu-racy, we maximize utility only with regard to short-term goals. Totake the example of risk assessment, maximizing utility means min-imizing current risk. This does not take into account the value ofchanging risk assessment so as to minimize, say, future risk, whichcan be tied to, say, racial justice. It is explicitly noted in [5] thatutility according to present risk scores is “immediate utility”. Fur-thermore, note that if we have a predictor 𝑅 that is not perfect, andfalse positives and false negatives have different utilities, we mayhave to choose a predictor 𝑅 ′ that is even less accurate than 𝑅 tomaximize utility.The second problem is the step from conclusion 1 to conclusion2. As is often pointed out in the computer science literature, wevirtually never have a perfect predictor. So we are almost neverin a situation where it actually matters that a fairness measure isconservative, i.e., that the measure is compatible with the perfectpredictor. However, if we are almost never in this situation, con-servativeness is a theoretical concern, but practically irrelevant. Asituation that is practically irrelevant should not guide our choice AccT ’21, March 1–10, 2021, Virtual Event, Canada Tim Räz of fairness measure. So, there is no practical reason to prefer con-servative fairness measures over non-conservative ones. It could be thought that the above argument also goes throughfor broader notions of conservativeness, i.e., that it holds for incre-ments of accuracy: if we increase accuracy, and this automaticallyincreases the degree to which a fairness measure holds, then wedo not need a perfect predictor for accuracy to be of practical rel-evance; the two align in increments. In fact, Hardt et al. appear tohave an argument along these lines in mind. Immediately after thepassage quoted above, they write:[O]ur criterion is easier to achieve the more ac-curate the predictor ˆ 𝑌 is, aligning fairness withthe central goal in supervised learning of build-ing more accurate predictors. [8, p. 2]This claim, however, is false in view of proposition 8, which es-tablishes that it is possible to start with a predictor 𝑅 that satisfiesseparation, increase the accuracy of 𝑅 , and obtain a new predic-tor 𝑅 ′ that no longer satisfies separation. Proposition 8 shows thatboth separation and sufficiency are not incrementally conservative,and that, therefore, an incremental version of the above argumentdoes not support separation or sufficiency as opposed to indepen-dence.Let us now turn to the second reading of the argument in thequote from Hardt et al., which focuses on the relation betweenaccuracy and fairness: 𝑃 ∗ A perfect predictor is (maximally) fair, because it aligns withthe actual outcome. 𝑃 ∗ Independence is a non-conservative fairness criterion (is notgenerally compatible with a perfect predictor), while separa-tion is a conservative fairness criterion (is compatible witha perfect predictor).C1* Therefore, independence is not generally compatible with a(maximally) fair predictor, while separation is.C2* Therefore, separation should be preferred over independence.There are, again, two problems with this argument. The firstproblem, the step from the first to the second conclusion, was al-ready discussed above – we can reasonably doubt the practical rel-evance of perfect predictors, because they are virtually never real-ized, and an incremental version of the argument is demonstrablyfalse. The second, more fundamental problem is premiss 𝑃 ∗ . Thispremiss is unsupported, and, arguably, wrong in general. Premiss 𝑃 ∗ is problematic both from a philosophical and from a computerscience perspective.From a computer science perspective, there are important as-pects of algorithmic fairness that are not captured by group fair-ness measures, and this is well known. Take, for example, the kindsof fairness discussed in [10]; see section 2.4 above. Negative legacy is unfairness due to unfair sampling or labeling. Consider the caseof unfair labeling. Unfair labeling means that the distribution 𝑃 ( 𝑌, 𝐴 ) is unfair, i.e., the distribution of actual outcomes 𝑌 we measure ata certain point in time favors one of the groups in 𝐴 over anotherin a way we consider to be unfair. What premiss 𝑃 ∗ says is that a Note that from a conceptual or philosophical point of view, it could be worthwhileto explore the case of perfect predictors. The argument made here takes the morepractical position of computer science that perfect predictors are negligible as a pointof departure. perfect predictor 𝑅 is fair because it aligns with the actual outcome,i.e., because we have 𝑌 = 𝑅 . However, 𝑌 = 𝑅 only provides a goodjustification of the fairness of 𝑅 with respect to 𝐴 , i.e., of 𝑃 ( 𝑅, 𝐴 ) ,if the distribution 𝑃 ( 𝑌, 𝐴 ) itself is fair, which need not be the caseif labeling is unfair; this is what Kamishima et al. point out. Thedistribution 𝑃 ( 𝑌, 𝐴 ) can arise through unfair practices, historicalbiases, and so on.Importantly, Kamishima et al. also point out that this sort of un-fairness is hard to detect or measure if we do not have access to asample with fair labeling, such that we can obtain a fair estimate of 𝑃 ( 𝑌, 𝐴 ) . But of course, just because it can be hard, or even impos-sible, to quantify negative legacy, does not mean that this quantityis of no ethical import. Fairness is completely independent of ourability to measure it.Let us illustrate these points with some examples. Why shouldwe think that an accurate predictor is fair? One of the reasons maybe that an accurate predictor aligns with the ground truth 𝑌 . Andtrying to align predictions with the truth should not be consideredto be discriminatory – this is the point made by Hardt et al. inthe above quote. To address this point, recall what truth meansin the present context: It means that 𝑌 captures what we observein the world at a certain point. For example, we observe that peo-ple from group 𝑝 in fact get arrested more frequently than peoplefrom group 𝑞 , we observe that group 𝑝 in fact has more loan ap-plications rejected than group 𝑞 , and so on. This is what the jointdistribution of 𝑌 and 𝐴 captures. In other words, the distribution 𝑃 ( 𝑌, 𝐴 ) is a picture of the status quo. However, the world as it isat a certain point, or the status quo, is not a moral category. It isjust a description of what we find in the world. It does not answerthe question whether the world as we find it is fair, or morally jus-tified. Finding the world to be a certain way, and inferring fromthis that the world ought to be this way, is committing a fallacy ac-cording to some philosophers, based on a confusion between factsand values; see, e.g., the discussion of the Is-Ought gap in [15, Sec.2.1.].At this point, it could be objected that in some cases, the distribu-tion of labels does have moral import. Take, for example, the often-mentioned case of violent offenders. If the distribution 𝑃 ( 𝑌, 𝐴 ) cap-tures the historical record of reoffending of violent criminals inthe past, then it makes sense to align our predictor 𝑅 with 𝑌 . Itseems that we cannot just ignore the historical record in favor ofa group fairness measure such as independence. The price we payby releasing (potentially) violent criminals from one group, or bylocking up (potentially) innocent members of the other group be-cause these groups have different frequencies with respect to 𝑌 ,seems very high, and the choice of such a predictor seems morallywrong. A form of this argument is made in the following passage of[3, p. 14]: “[Independence] has been criticized because it can leadto highly undesirable decisions for individuals (Dwork et al. 2012).One might incarcerate Muslims who pose no public safety risk sothat the same proportions of Muslims and Christians are releasedon parole.”The response to this objection is that it is perfectly possible thatignoring the status quo has undesirable moral consequences , as inthe case of violent offenders. However, this does not invalidate thepoint that the status quo in itself does not have moral status. It roup Fairness: Independence Revisited FAccT ’21, March 1–10, 2021, Virtual Event, Canada just means that the status quo can impact considerations of fair-ness in some cases, and that we may have to weight the moral con-sequences of sticking to or deviating from the status quo againstother considerations of fairness. We will turn to a discussion ofhow this could be achieved in the next section. So far, we have examined arguments against independence, andwe have found that the case against independence is not as clearcut as some of the computer science literature suggests. In this sec-tion, we turn to the case for independence. Why is independencea good or useful fairness measure? We compare independence toother notions of group fairness to highlight its usefulness, but alsoits limitations. Our goal is not to recapitulate the philosophical lit-erature that supports independence. Rather, our goal is to establishsome connections between philosophical concerns and the moreformal discussion in computer science.Independence is defined as 𝑅 ⊥ 𝐴 , that is, probabilistic indepen-dence of group membership and prediction. Note that in practice,it makes sense to not require strict independence, but an approxi-mate version of independence. One justification of independenceis that it controls, and potentially compensates, for historical injus-tice. One manifestation of historical injustice is what Kamishimaet al. call negative legacy [10], viz. a distribution 𝑃 ( 𝑌, 𝐴 ) that weconsider to be unjust. The distribution can be unjust because itdoes not adequately represent the true properties of the groups in-volved – this would correspond to unfair sampling, in which casewe may not know the true distribution – , or because the distribu-tion does represent the true properties of the groups involved, butthese properties themselves did not come about in a fair way – thiswould correspond to unfair labeling. Formally, negative legacy can manifest as a correlation betweengroup membership 𝐴 and ground truth 𝑌 , i.e., 𝑌 𝐴 : if the groups 𝐴 should have equal access to the outcome encoded by 𝑌 , thereshould be no correlation between group membership and outcome,i.e., we should have 𝑌 ⊥ 𝐴 . Note that, as in the case of indepen-dence, we can formulate an approximate version of this require-ment. Now, if we build a predictor 𝑅 with a focus on accuracy, asit is usually the case, we get 𝑅 ≈ 𝑌 , i.e., the predictor is approx-imately accurate. However, this also implies that the predictor 𝑅 does not satisfy (an approximate version of) independence. Thus,independence helps us detect this form of historical injustice, andit suggests that we modify 𝑅 , such that, approximately, we obtain 𝑅 ⊥ 𝐴 . This modification of 𝑅 may also influence negative legacyin the long run by moving the distribution 𝑌 closer to the desired 𝑌 ⊥ 𝐴 over time, such that accuracy and independence align natu-rally. This is one argument in favor of independence.To better understand the usefulness of independence as a fair-ness measure, let us compare it to other kinds of measures. Take,first, sufficiency and separation. The main difference between inde-pendence on the one hand, and sufficiency and separation on theother, is that independence is formulated without 𝑌 . This means Note that above, we have excluded the first case through the assumption that theconfusion matrices are at least approximately representative of the true probabilities.We have not excluded the second case. that while sufficiency and separation track the difference betweena prediction 𝑅 and the truth given by 𝑌 – they are measures of erroror deviation from the truth – independence does not track devia-tion from the truth. Prima facie , this may seem like a deficiencyof independence. However, as was just explained, independencehelps us detect unfairness in the distribution of 𝑌 exactly becauseit does not focus on deviations from 𝑌 . It helps us to see what maybe wrong with the distribution of 𝑌 itself. This is an advantage ofindependence in contrast to separation and sufficiency.Now let us compare independence to affirmative action, viz., therequirement that predictions 𝑅 have to satisfy certain thresholds orquota. In the case of college admissions, the requirement could bethat a certain percentage of admitted candidates have to be mem-bers of a racial minority; see [9] for a discussion of affirmative ac-tion in the context of college admissions. A justification for affirma-tive action is to compensate for historical injustice. In this respect,the justification of affirmative action is similar to the justificationof independence given above.However, there are also important differences between indepen-dence and affirmative action. One difference is that independenceonly requires predictions to be independent of group membership.Affirmative action, on the other hand, can be more stringent in re-quiring that predictions satisfy certain proportions. For example,if only 10% of college applicants belong to a minority, indepen-dence would require that the admission rate for these 10% is thesame as the general admission rate, while affirmative action mayrequire that the admission rate among the 10% is larger to allow fora given balance of admitted candidates, irrespective of applicationrates. This means that independence, formulated for a given set ofapplicants, will not correct for certain kinds of biases such as un-derrepresentation of groups among applicants, while affirmativeaction may correct for this kind of bias.More generally, it should be stressed that while independencemay highlight and help to compensate for certain kinds of histori-cal injustice, implementing it will not correct for many other formsof injustice. In particular, independence prescribes an interventiononly on the prediction 𝑅 , which can be interpreted as a compen-sation for a certain distribution of 𝑌 , and does not prescribe anintervention on the causes of this distribution, or an interventionon the effects of this distribution. We have now seen arguments both in favor and against indepen-dence, and we have found that there is some validity to argumentson both sides. How should we proceed from here? How shouldthese arguments be weighted? We will not be able to answer thesequestions here, but we can provide some rough guidelines in viewof the above discussion.First, we should always explicitly state the moral value of eitherchoosing or rejecting a group fairness measure such as indepen-dence, as opposed to arguing solely on the basis of factual and de-scriptive properties of fairness measures. We have seen why thisis important in the case of independence. We have argued that ac-curacy in and of itself does not have moral value. We do not deny AccT ’21, March 1–10, 2021, Virtual Event, Canada Tim Räz that accuracy can be morally beneficial in certain situations or con-texts; however, it is these moral benefits we care about, and theyshould be stated. For example, if neglecting accuracy has substan-tial social costs in some cases, this is what we care about, and notaccuracy per se. Only once the values supporting arguments foror against independence have been made explicit can we weightthem.Second, gerrymandering is a problem shared by all measuresof group fairness. It is possible to violate individual fairness whilecomplying with sufficiency or separation, just as it is possible tocomply with independence. Now, there is already a lot of work incomputer science dealing with this problem, beginning with [7],who examine under which conditions independence and individ-ual fairness can be combined. One of the problems of combiningmeasures of group fairness and individual fairness will be, oncemore, to make the moral value of either choice explicit and assignappropriate weights to these choices.Third, it is a mistake to think that we can either require or rejectmeasures of group fairness independently of the case to which itis applied. Rather, the importance of different group fairness mea-sures is context dependent. We have seen examples of this above:The cost of requiring independence in the case of classifying vio-lent offenders is different from the cost of requiring independencein the case of college admissions. In the first case, the cost of mak-ing mistakes seems high; in the second case, the cost of makingmistakes seems lower both for individuals and for society; see [4].Fourth, the preceding two points suggest that none of the groupfairness measures we discussed here are logically necessary or log-ically sufficient for fairness: They cannot be logically sufficient be-cause they violate individual fairness at least in some cases, andthey cannot be logically necessary because they appear to be inconflict with our intuitions about fairness in other cases. This alsosuggests to interpret these measures of group fairness not as abso-lute criteria for fairness. Rather, they can be indicative of fairnessor unfairness depending on the case at hand.
In this paper, we have examined the discussion of independencein the computer science literature, and we have found that somearguments against independence are not convincing in that theyeither equally apply to other measures of group fairness, or undulyemphasize descriptive properties of fairness measures, viz. conser-vativeness, as opposed to normative ones. We have also made apositive case for independence, arguing that it can highlight a dis-tinct kind of unfairness not captured by sufficiency or separation.The main upshot of the present paper is that independence is animportant measure of group fairness that has to be taken into ac-count in discussions of algorithmic fairness.
A PROOFS
Here we give proofs of the propositions in the main text. All propo-sitions and proofs can be found in the literature [2, 6, 16] and arecollected here for convenience’s sake, except for the proof of propo-sition 8, which is new. We first state some useful properties of con-ditional independence (see the above references for proofs):
Proposition 16.
Properties of conditional independence: (1) If 𝑋 ⊥ 𝑌 | 𝑍 , then 𝑌 ⊥ 𝑋 | 𝑍 ;(2) if 𝑋 ⊥ 𝑌 | 𝑍 and 𝑈 = ℎ ( 𝑋 ) , then i) 𝑈 ⊥ 𝑌 | 𝑍 and ii) 𝑋 ⊥ 𝑌 | ( 𝑍, 𝑈 ) ;(3) if 𝑌 = ℎ ( 𝑍 ) , then 𝑋 ⊥ 𝑌 | 𝑍 ;(4) 𝑋 ⊥ 𝑌 | 𝑍 and 𝑋 ⊥ 𝑊 | ( 𝑌, 𝑍 ) iff. 𝑋 ⊥ ( 𝑊 , 𝑌 ) | 𝑍 ;(5) if 𝑋 ⊥ 𝑌 | 𝑍 , 𝑋 ⊥ 𝑍 | 𝑌 , and (X,Y,Z) is positively distributedeverywhere, then 𝑋 ⊥ ( 𝑌, 𝑍 ) .Note that properties 1, 2, 3 also hold without conditioning on 𝑍 . Proof of proposition 4:
A perfect predictor means that 𝑌 = 𝑅 .Sufficiency means 𝐴 ⊥ 𝑌 | 𝑅 and separation means 𝐴 ⊥ 𝑅 | 𝑌 .For a perfect predictor, these reduce to 𝐴 ⊥ 𝑌 | 𝑌 . By property3 of conditional independence, this is true for a perfect predictorbecause 𝑌 = 𝑓 ( 𝑌 ) . (cid:3) Proof of proposition 6:
The direction (1) ⇒ (2) is property 5 of con-ditional independence. The direction (2) ⇒ (1) can be seen as fol-lows: view ( 𝑌, 𝑅 ) as a two-dimensional random variable, note that 𝑌 and 𝑅 are functions of this random variable (projection), then theresult follows from property 2 of conditional independence (with-out conditioning on 𝑍 ). (cid:3) Proof of Proposition 8 (Incremental Conservativeness):
We show that sufficiency and separation are not, in general, pre-served if the accuracy of a predictor is increased, by giving an ex-ample where accuracy increases but separation and sufficiency arelost. First, consider the two following confusion matrices (recallthat 𝑌 stands for the true label, while 𝑅 stands for the prediction):Y+ – totalR +
10 2 12 – total
13 13
Table 3: Group A=p
Y+ – totalR +
20 4 24 – total
26 26
Table 4: Group A=q
These matrices satisfy sufficiency and separation; the easiestway to see this is to check that the table for 𝑞 is a multiple of thetable for 𝑝 , so the relative frequencies are the same, which impliesthat sufficiency and separation are satisfied by proposition 6. It canalso be checked by hand, by using the relation between the sta-tistics of confusion matrices on the one hand and fairness on theother, explained in section 2.3. Now we increase the accuracy ofthe predictor 𝑅 , by taking, in each group, an element of the falsenegatives and shifting it to the true positives. This yields a newpredictor 𝑅 ′ with the following confusion matrices:Y+ – totalR’ +
11 2 13 – total
13 13
Table 5: Group A=p
Y+ – totalR’ +
21 4 25 – total
26 26
Table 6: Group A=q
Note that the predictor is more accurate in both groups. Nowwe check whether these tables satisfy sufficiency and separation.For sufficiency, we would need that the positive predictive values(PPV) agree, but we have: roup Fairness: Independence Revisited FAccT ’21, March 1–10, 2021, Virtual Event, Canada 𝑎𝑎 + 𝑏 = ≠ = 𝑎 ′ 𝑎 ′ + 𝑏 ′ (1)For separation, we would need that the false negative rates (FNR)agree, but we have: 𝑐𝑎 + 𝑐 = ≠ = 𝑐 ′ 𝑎 ′ + 𝑐 ′ (2)Thus, we have increased accuracy and lost both separation andsufficiency. This shows that separation and sufficiency are not in-crementally conservative fairness measures. (cid:3) Note that if we had increased accuracy in proportion to groupsize, i.e., if we had shifted two elements instead of one from falsenegatives to true positives in group 𝑞 , we would have preservedsufficiency and separation. The reason for this is that this incre-ment would have preserved the proportions of the confusion ma-trices between the two groups. However, this is a very special kindof increment. The case we have discussed above, with incrementsnot proportional to the size of the groups, is easier to realize andpresumably more common. Proof of proposition 10:
Sufficiency for groups 𝐴 = 𝑝, 𝑞 means, inthe case 𝑅 = + and 𝑌 = + : 𝑃 ( 𝑌 = + | 𝐴 = 𝑝, 𝑅 = +) = 𝑃 ( 𝑌 = + | 𝐴 = 𝑞, 𝑅 = +)⇔ 𝑎𝑎 + 𝑏 = 𝑎 ′ 𝑎 ′ + 𝑏 ′ , where the choice of 𝑌 = − yields an equivalent condition; thesame reasoning holds for 𝑅 = − . (cid:3) Proof of proposition 11:
Similar to proof of proposition 10.
ACKNOWLEDGMENTS
I thank Michele Loi, Corinna Herweck, and members of the philos-ophy of science research colloquium in the Fall of 2020 at the Uni-versity of Bern for helpful comments on an earlier draft of the pa-per. This work is supported by the National Research Programme“Digital Transformation” (NRP 77) of the Swiss National ScienceFoundation (SNSF) under Grant No.: 187473.
REFERENCES [1] Angwin, J., J. Larson, S. Mattu, and L. Kirchner. 2016. Machine bias: There’s soft-ware used across the country to predict future criminals. and it’s biased againstblacks. ProPublica.[2] Barocas, S., M. Hardt, and A. Narayanan. 2019.
Fairness and Machine Learning .fairmlbook.org.[3] Berk, R., H. Heidari, S. Jabbari,M. Kearns, and A. Roth. 2018. Fairness in CriminalJustice Risk Assessments: The State of the Art.
Sociological Methods & Research .[4] Chouldechova, A. 2017. Fair prediction with disparate impact: A study of biasin recidivism prediction instruments. ArXiv:1703.00056v1.[5] Corbett-Davies, S., E. Pierson, A. Feller, S. Goel, and A. Huq. 2017. AlgorithmicDecision Making and the Cost of Fairness.
KKD ’17 : 797–806.[6] Dawid, A. P. 1979. Conditional Independence in Statistical Theory.
Journal ofthe Royal Statistical Society. Series B (Methodological)
Designing Affirmative Action Policies under Uncertainty . Mas-ter’s thesis, University of Helsinki.[10] Kamishima, T., S. Akaho, and J. Sakuma. 2011. Fairness-aware Learning throughRegularization Approach. 2011 IEEE 11th International Conference on Data Min-ing Workshops. [11] Kearns, M., S. Neel, A. Roth, and Z. S. Wu. 2018. Preventing Fairness Gerryman-dering: Auditing and Learning for Subgroup Fairness.
PMLR
80: 2564–2572.[12] Kleinberg, J. M., S. Mullainathan, and M. Raghavan. 2016. Inherent Trade-Offsin the Fair Determination of Risk Scores.
CoRR abs/1609.05807.[13] Loi, M., A. Herlitz, and H. Heidari. 2019. A Philosophical Theory of Fairness forPrediction-Based Decisions. Http://dx.doi.org/10.2139/ssrn.3450300.[14] Miller, D. 2017. Justice. The Stanford Encyclopedia of Philosophy.[15] Väyrynen, P. 2019. Thick Ethical Concepts. The Stanford Encyclopedia of Phi-losophy.[16] Wasserman, L. 2004.
All of Statistics . Springer Texts in Statistics. New York:Springer.[17] Zemel, R., Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. 2013. Learning fair rep-resentations.