[PDF] A Ranking Approach to Fair Classification

Abstract

Algorithmic decision systems are increasingly used in areas such as hiring, school admission, or loan approval. Typically, these systems rely on labeled data for training a classification model. However, in many scenarios, ground-truth labels are unavailable, and instead we have only access to imperfect labels as the result of (potentially biased) human-made decisions. Despite being imperfect, historical decisions often contain some useful information on the unobserved true labels. In this paper, we focus on scenarios where only imperfect labels are available and propose a new fair ranking-based decision system based on monotonic relationships between legitimate features and the outcome. Our approach is both intuitive and easy to implement, and thus particularly suitable for adoption in real-world settings. More in detail, we introduce a distance-based decision criterion, which incorporates useful information from historical decisions and accounts for unwanted correlation between protected and legitimate features. Through extensive experiments on synthetic and real-world data, we show that our method is fair in the sense that a) it assigns the desirable outcome to the most qualified individuals, and b) it removes the effect of stereotypes in decision-making, thereby outperforming traditional classification algorithms. Additionally, we are able to show theoretically that our method is consistent with a prominent concept of individual fairness which states that "similar individuals should be treated similarly."

Full PDF

aa r X i v : . [ c s . L G ] F e b A Ranking Approach to Fair Classiﬁcation

Jakob Schoeffer , Niklas Kuehl , and Isabel Valera Karlsruhe Institute of Technology (KIT), Germany Saarland University, Saarbr¨ucken, Germany { jakob.schoeffer,niklas.kuehl } @kit.edu , [email protected] A BSTRACT

Algorithmic decision systems are increasingly used in areas such as hiring, school admission, or loanapproval. Typically, these systems rely on labeled data for training a classiﬁcation model. However,in many scenarios, ground-truth labels are unavailable, and instead we have only access to imperfect labels as the result of (potentially biased) human-made decisions. Despite being imperfect, histor-ical decisions often contain some useful information on the unobserved true labels. In this paper,we focus on scenarios where only imperfect labels are available and propose a new fair ranking-based decision system, as an alternative to traditional classiﬁcation algorithms. Our approach isboth intuitive and easy to implement, and thus particularly suitable for adoption in real-world set-tings. More in detail, we introduce a distance-based decision criterion, which incorporates usefulinformation from historical decisions and accounts for unwanted correlation between protected andlegitimate features. Through extensive experiments on synthetic and real-world data, we show thatour method is fair, as it a) assigns the desirable outcome to the most qualiﬁed individuals, and b)removes the effect of stereotypes in decision-making, thereby outperforming traditional classiﬁca-tion algorithms. Additionally, we are able to show theoretically that our method is consistent witha prominent concept of individual fairness which states that “similar individuals should be treatedsimilarly.”

Algorithmic decision systems have been increasingly used for decision support in recent years. A com-mon perception is that algorithms can avoid human bias and make more objective and transparent decisions[Castelluccia and Le M´etayer, 2019]. However, as algorithms support humans with evermore consequential decisions,they have also become subject to enhanced scrutiny. In 2016, journalists at ProPublica found that COMPAS, a systemused by US courts to assess defendants’ risk of recidivism, was unfair towards black people [Angwin et al., 2016]. InNovember 2019, Bloomberg reported that Steve Wozniak was suspecting the algorithm that determines credit limitsfor Apple’s credit card to discriminate against women [Nasiripour and Natarajan, 2019]. These and other examplesmake obvious the need for understanding root causes and developing techniques to combat algorithmic unfairness. Inlarge part, prior work has focused on formalizing the concept of fairness and enforcing certain statistical equity con-straints when making predictions—mostly in a setting of binary classiﬁcation, for instance when it must be decidedwhether a loan should be offered or not.However, traditional classiﬁcation algorithms require access to actual ground-truth labels, which are often unavailable[Lakkaraju et al., 2017]. In practice, we may only have access to imperfect labels, generally as the result of (potentiallybiased) historical human-made decisions. Inspired by the argumentation of Kilbertus et al. [2020], we propose to not learn to predict imperfect labels. Instead, we introduce a fair decision criterion based on an observation’s distanceto what we call the

North Star —a (potentially hypothetical) observation that is most qualiﬁed in a given scenario.Our approach induces both a ranking and an opportunity to classify observations. We also put forward ideas to a)incorporate useful information from historical decisions (Section 3.2) and b) reduce the importance of features that arehighly correlated with protected attributes (e.g., gender) in the decision-making process (Section 3.3). The rest of thepaper is structured as follows: In Section 2, we introduce important concepts and related work. Section 3 representsthe core of our work—the methodology as well as theoretical results. In Section 4, we illustrate our method by theexample of the German Credit data set, and we conduct extensive experiments on synthetic data in Section 5. Section6 concludes our work. 1

REPRINT

To lay out the foundations, we brieﬂy introduce important concepts related to our proposed methodology. We start witha summary of the notation used in this paper. We call A the set of protected features which must not be discriminatedagainst, and let a k ∈ A, k ∈ { , . . . , K } be the individual protected features. The US Equality Act [116th Congress,2019], for instance, deﬁnes sex, gender identity, and sexual orientation, among others, as protected features. In linewith other related work, we assume that the decision whether a feature is protected or not is made externally [ ˇZliobait˙e,2015]. We further deﬁne x ℓ ∈ X, ℓ ∈ { , . . . , L } as the non-protected (or legitimate ) features, and Y as the set of imperfect labels, which can be either positive ( + ) or negative ( − ). We call these labels imperfect as they are onlya noisy signal of the true labels. A given data set consisting of A , X , Y is referred to as D . Lastly, we call b Y the predictor, a function that maps observations to positive or negative outcomes. When referring to an individualobservation, we use superscripts like A ( i ) , which would be the set of protected features for observation ( i ) , while wehave N observations in total. Mehrabi et al. [2019] deﬁne fairness in the context of decision-making as the “absence of any prejudice or favoritismtowards an individual or a group based on their intrinsic or acquired traits.” Generally, existing literature distinguishesindividual from group fairness deﬁnitions. In this work, we are primarily concerned with individual fairness—howeverwe also make group fairness-related arguments in the spirit of demographic parity [Zafar et al., 2017a] later on. Atypical approach aiming at individual fairness is fairness through awareness (FTA) [Dwork et al., 2012]. We willbrieﬂy introduce this, as well as the conception of fairness through unawareness (FTU) [Grgic-Hlaca et al., 2016],due to their importance for our work.

Deﬁnition 1 (Fairness through unawareness) . An algorithm is fair so long as any protected features are not explicitlyused in the decision-making process.

FTU simply requires a predictor b Y to ignore all protected features in A . However, as Hardt et al. [2016] argue, thisdeﬁnition is ineffective in the presence of strong correlation between protected and legitimate features. In the workat hand, we address this issue by penalizing highly correlated features with respect to their importance for decision-making. Deﬁnition 2 (Fairness through awareness) . An algorithm is fair if it gives similar predictions to similar individuals.

Formally, FTA requires an appropriate distance metric d ( · , · ) . If, for two individuals ( i ) and ( j ) , d ( i, j ) is small, thenFTA requires that b Y ( i ) ≈ b Y ( j ) . As stated by Dwork et al. [2012], the main challenge with this notion is deﬁning anappropriate distance metric. In most cases, this requires domain-speciﬁc knowledge. Most of existing work on algorithmic fairness has been concerned with fair classiﬁcation. Herein, numerous arti-cles have been published on how to formally deﬁne [Grgic-Hlaca et al., 2016, Hardt et al., 2016, Dwork et al., 2012,Kusner et al., 2017, Corbett-Davies et al., 2017, Pedreshi et al., 2008] and enforce [Kilbertus et al., 2020, Zafar et al.,2017a,b, Hardt et al., 2016, Kamiran and Calders, 2009, 2012, Kamishima et al., 2012, Calmon et al., 2017] fairness.Generally, fairness-aware techniques can be divided into three categories which are based on the process step ofthe application: The ﬁrst category is concerned with removing existing bias from the training data (pre-processing).Typical approaches involve transformation of the data [Calmon et al., 2017] or changing the class labels for training[Kamiran and Calders, 2012]. The second category involves modiﬁcation of existing algorithms (in-processing), typ-ically through adding fairness constraints [Zafar et al., 2017b] or penalizing discrimination, for instance by means ofregularization [Kamishima et al., 2012]. The third category includes all techniques aimed at changing the output of apotentially unfair model (post-processing). Hardt et al. [2016], for instance, construct a non-discriminating predictorfrom an existing one via solving an optimization problem. However, if ground-truth labels are not (or selectively)available [Lakkaraju et al., 2017], then maximizing for prediction accuracy seems counterintuitive and is, in fact, sub-optimal [Kilbertus et al., 2020].More recently, alternative concepts based on the theory of causal inference have evolved [Kilbertus et al., 2017,Kusner et al., 2017]. While these approaches have shown promising results, they generally make strong assump-tions about the causal structure of the world. Additional shortcomings and misconceptions about causal models in therealm of algorithmic fairness are discussed, for instance, by Hu and Kohler-Hausmann [2020].2

REPRINT

Fair ranking approaches can be split into pre-processing [Yang and Stoyanovich, 2017], in-processing[Zehlike and Castillo, 2020], and post-processing [Biega et al., 2018, Celis et al., 2018, Singh and Joachims, 2017,2018, Zehlike et al., 2017] techniques as well [Castillo, 2019]. With respect to quantifying fairness, most of exist-ing methods apply an attention-based criterion, aiming at equalizing exposure of observations in, for instance, (web)searches [Biega et al., 2018, Singh and Joachims, 2017, 2018]. Furthermore, a majority of existing literature has beenfocusing on achieving group fairness, whereas individual fairness considerations for rankings remain scarce—withfew exceptions, such as work by Biega et al. [2018], where the authors introduce a mechanism to achieve individualfairness across a series of rankings. To the best of our knowledge, no existing work on fair ranking is closely relatedto ours in terms of methodology.Perhaps the most related work is an article by Wang and Gupta [2020]. Here, the authors put forward the idea ofoptimizing classiﬁcation accuracy subject to a classiﬁer being monotonic in a given set of features. Thereby, it isargued, the classiﬁer can evade violating “common deontological ethical principles and social norms such as [. . . ] ‘donot penalize good attributes’.” While the idea of enforcing monotonicity constraints is similar, we have identiﬁed threemajor differences to our work:1. Wang and Gupta [2020] use supervised machine learning to predict ground-truth labels, whereas we assumeimperfect labels. Our proposal in the case of imperfect labels is to not maximize for accuracy in the ﬁrstplace.2. They do not take measures to prevent the algorithm from “exploiting” protected information to achieve higheraccuracy.3. They do not account for the well-known problem of indirect discrimination, which occurs when (seemingly)legitimate features are highly correlated with protected features.

In this chapter, we introduce our proposed ranking algorithm for decision-making with imperfectly labeled data, thatis, the common case where ground-truth labels are not available. Speciﬁcally, we assume we are given data D withimperfect labels stemming from human-made decisions, for instance whether an applicant was admitted to graduateschool or not.Our approach follows a notion of individual fairness that aims at uniting both fairness deﬁnitions from Section 2.1,FTU and FTA. Note that this idea is closely related to the concept of “meritocratic fairness”, as coined by Kearns et al.[2017]. We deﬁne it as follows: Deﬁnition 3 (Meritocratic fairness) . An algorithm is fair if it assigns the positive outcome to the most qualiﬁedobservations, regardless of protected features.

This deﬁnition is in line with many equal employment opportunity policies, yet disregarding afﬁrmative action. Basedon Deﬁnition 3, we can also deﬁne meritocratic un fairness: Deﬁnition 4 (Meritocratic unfairness) . An individual observation ( i ) is treated unfairly over a different observation ( j ) if ( i ) is more qualiﬁed than ( j ) but: a) ranked lower, or b) assigned ( − ) while ( j ) is assigned (+) . We will use Deﬁnition 4 for evaluation purposes later on. However, a deﬁnition of what qualiﬁed means can hardly begiven without knowledge of the respective use case—a viewpoint that is shared, among others, by Dwork et al. [2012].We will address this now.To illustrate our ideas, we construct a (simpliﬁed) synthetic graduate school admission data set and use it as a runningexample—an excerpt is shown in Table 1. The data set consists of 1,000 observations, with 50% being males and 50%Table 1: Exemplary graduate school admission data.ID Gender GRE V GRE Q GRE AW Y male

147 144 3.0 (+) male

146 140 3.5 (+) . . . . . . . . . . . . . . . . . .11 female

153 147 3.5 ( − ) . . . . . . . . . . . . . . . . . .females (protected feature). The GRE scores (legitimate features) are three-dimensional: GRE V(erbal Reasoning),3 REPRINT

GRE Q(uantitative Reasoning), GRE A(nalytical) W(riting). We sample them from multivariate Gaussian distributions N ( µ m ( ale ) , Σ m ) and N ( µ f ( emale ) , Σ f ) , where µ m = [150 . , . , . , µ f = [150 . , . , . , and the covariance matrices Σ m = " .

00 28 .

15 5 . .

15 84 .

64 1 . .

43 1 .

16 0 . , Σ f = " .

61 24 .

51 4 . .

51 79 .

21 1 . .

34 1 .

00 0 . are derived from the ofﬁcial data provided by the administrator of the GRE test [ETS, 2019a,b]. For compliance withthe ofﬁcial ranges of scores, we round and truncate the sampled scores such that a) GRE V and GRE Q scores arebetween 130 and 170 in one-point increments, and b) GRE AW scores are between 0 and 6 in half-point increments.To simulate historical admission decisions, we scale the legitimate features between 0 and 1, and generate imperfectlabels as Y = (cid:26) (+) if . · male + 0 . · GRE V + 0 . · GRE Q + 0 . · GRE AW + ǫ > . − ) otherwise , where is the indicator function and ǫ ∼ U (0 , . is noise. Note that male applicants are given an unfair advantageover their female counterparts. Apart from this, the high importance of GRE Q scores could be representative of atechnical university’s admission process.We assume that for any speciﬁc use case, we are given (e.g., by an expert) or can easily derive the information ofhow any legitimate and relevant feature should impact the ﬁnal decision—speciﬁcally, whether higher ( ↑ ) or lower( ↓ ) values are beneﬁcial with respect to the positive outcome. Note that if certain feature interactions have a knownmonotonic relationship with the outcome, then these interactions can be added as additional features and assigned a( ↑ ) or ( ↓ ) as well. Certainly, in many cases these dependencies are obvious and need not be veriﬁed by an expert. Forinstance, in our graduate school example, it is clear that high GRE scores are more beneﬁcial towards being admittedthan low scores. Alternatively, if obtaining this information from an expert is too expensive, we could potentially inferthe ( ↑ ) or ( ↓ ) relationships from D . The idea that relevant features should have a monotonic relationship with theoutcome is, for instance, similarly introduced by Wang and Gupta [2020].With this information, we ﬁrst scale the legitimate features X such that all values are in [0 , . We call the scaledlegitimate features z ℓ ∈ Z, ℓ ∈ { , . . . , L } . We further require that the probability of the positive outcome increaseswith the value of any z ℓ . For that, we perform z ℓ ← (1 − z ℓ ) if the original relationship between x ℓ and the outcomeis ( ↓ ). These steps are summarized in Algorithm 1. Note that we can also apply Algorithm 1 to observations that arenot contained in D —in that case, we need to assume that the resulting values of z are capped at 0 and 1. Algorithm 1:

Scaling of legitimate features.

Input :

Legitimate features x , . . . , x L of D , including ( ↑ ) or ( ↓ ) relationships. Output:

Scaled features z , . . . , z L . for ℓ ∈ { , . . . , L } do z ( i ) ℓ ← (cid:16) x ( i ) ℓ − min j x ( j ) ℓ (cid:17)(cid:16) max j x ( j ) ℓ − min j x ( j ) ℓ (cid:17) for all i ∈ { , . . . , N } ; if x ℓ is ( ↓ ) then z ℓ ← (1 − z ℓ ) ; endend3.1 Measuring distance to the North Star Our idea is to fairly rank observations based on their distance to what we call the

North Star . Deﬁnition 5 (North Star) . Given a data set D and the respective legitimate features z ℓ , ℓ ∈ { , . . . , L } scaled as inAlgorithm 1, the North Star is a (potentially hypothetical) observation ( ⋆ ) that attains the maximum observed valuefor each legitimate feature: z ( ⋆ ) ℓ := max i ∈{ ,...,N } z ( i ) ℓ = 1 ∀ ℓ ∈ { , . . . , L } . REPRINT

Now, we can compute the distance of single observations to the North Star. For that, we choose the taxicab metric forits clear interpretation and its favorable behavior in higher dimensions [Aggarwal et al., 2001]. Note that the approachalso works for other metrics. We deﬁne the distance of an observation ( i ) to the North Star as follows: d ( i, ⋆ ) := L X ℓ =1 (cid:12)(cid:12)(cid:12) − z ( i ) ℓ (cid:12)(cid:12)(cid:12) = L X ℓ =1 (cid:16) − z ( i ) ℓ (cid:17) , (1)considering that z ( i ) ℓ ∈ [0 , for all ℓ and observations ( i ) . Note that we assume symmetry of our distance measure, thatis, d ( i, ⋆ ) = d ( ⋆, i ) . In our example, this distance would be 0 for applicants with the perfect scores of GRE V = 170 ,GRE Q = 170 , and GRE AW = 6 . . In a next step, we enhance the distance formula in Equation (1) with useful information from historical data. Despitethe fact that our given data D contains only imperfect labels, we argue that historical decisions often contain some useful information on the true labels. Speciﬁcally, we aim to extract the relative importance of legitimate featuresfrom historical decisions, assuming that important features from the past are still important at present. In our runningexample, for instance, we know that labels are biased, but we still want to capture that GRE Q scores are most importantfor admission at a technical university.Our rationale is the following: While Equation (1) implicitly treats every feature as having equal importance, wewant to account for the fact that some features are undoubtedly more important than others for decision-making.Even though we could explicitly ask experts for this information, similar to the ( ↑ ) or ( ↓ ) relationships, we argue thatmanually quantifying the importance of individual features (in %) is often intractable. We, therefore, propose to learnthese importances directly from D , for instance through the concept of permutation importance [Breiman, 2001]—which is deﬁned as the decrease in model score when the value of this respective feature is randomly permuted. Theresult is an estimate of how much a given model depends on this feature. For that, we train a classiﬁer on D and obtainfeature importances ω , . . . , ω L , with ω , . . . , ω L ≥ and P Lℓ =1 ω ℓ = 1 for all legitimate features. In a next step, wecan now adjust Equation (1) by adding ω , . . . , ω L as weights to reﬂect the importance of each legitimate feature: d ′ ( i, ⋆ ) := L X ℓ =1 ω ℓ (cid:16) − z ( i ) ℓ (cid:17) . (2)In the running example, we obtain ω V = 0 . , ω Q = 0 . , and ω AW = 0 . , with standard deviations σ V = 0 . , σ Q = 0 . , and σ AW = 0 . , by ﬁtting a random forest classiﬁer with bootstrapping and usingthe permutation importance function of scikit-learn [Pedregosa et al., 2011]. This means that d ′ would bemost sensitive to changes in GRE Q scores, as desired. As outlined in Section 2.1, as well as by Hardt et al. [2016] and Pedreshi et al. [2008], the fundamental weakness ofFTU as a notion of fairness is the fact that protected features can sometimes be predicted from legitimate features. Itis particularly problematic if legitimate features are highly correlated with protected features—we account for this bypenalizing high correlation.To measure general monotonic relationships between two data samples, we use Spearman’s rank correlation coefﬁcient(SRCC). For the rankings rk a and rk x of two samples a and x , SRCC ρ a,x is calculated as follows: ρ a,x = cov ( rk a , rk x ) σ rk a σ rk x , (3)where cov is the covariance and σ the standard deviation. An SRCC of ± occurs if one sample is a perfect monotonicfunction of the other. Note that SRCC also works for ordinal features, such as gender, race (ethnicity) and otherprotected features—this is important for our work.Using Equation (3), we compute: e ρ ℓ := max k ∈{ ,...,K } {| ρ a k ,z ℓ |} ∀ ℓ ∈ { , . . . , L } , (4) It might happen that the legitimate features cannot predict the historical labels reasonably well (e.g., if labels are random). Insuch cases we can skip this step. REPRINT as the maximum absolute rank correlation between a given legitimate feature z ℓ and any protected feature a k . Ourintuition behind taking the maximum over, for instance, the sum is that we do not want to penalize having many lowindividual absolute correlations—but rather scenarios where a seemingly legitimate feature is a (potentially noisy)proxy for one of the protected features. We can then use e ρ ℓ to further adjust the distance measure from Equation (2): d ′′ ( i, ⋆ ) := L X ℓ =1 ω ℓ (1 − e ρ ℓ ) (cid:16) − z ( i ) ℓ (cid:17) , (5)where high values of e ρ ℓ reduce the importance of z ℓ on the distance d ′′ . Note that in the extreme case of e ρ ℓ = 1 , thedistance d ′′ will be independent of feature z ℓ . This is desirable as it renders ineffective the possibility of introducingproxies for protected features under seemingly innocuous names.For our graduate school admission example, we calculate the SRCC values using the spearmanr function of SciPy [Virtanen et al., 2020]. Note that we only have one protected feature: gender . The absolute correlations are e ρ V =0 . , e ρ Q = 0 . , and e ρ AW = 0 . . While these values are not strikingly high, we may infer that there is astronger relationship between gender and GRE Q than with GRE V or GRE AW. The importance of GRE Q foradmission is thus reduced by 26.2%, as opposed to 3.5% and 16.7% for GRE V and GRE AW, respectively. Ingeneral, even if a seemingly legitimate feature was highly important for past decisions, its importance will vanish if itis highly correlated with a protected feature, as desired.Coming back to Deﬁnition 4, we now deﬁne what being more qualiﬁed could mean: Deﬁnition 6 (Higher qualiﬁcation) . We call an observation ( i ) more qualiﬁed than ( j ) if, according to the ( ↑ ) or ( ↓ ) relationships between features and positive outcome, ( i ) is better or equal than ( j ) for all legitimate features andstrictly better for at least one ℓ ′ ∈ { , . . . , L } , with ω ℓ ′ = 0 and e ρ ℓ ′ = 1 . Note that being more qualiﬁed is a stronger requirement than observation ( i ) having a shorter distance to the NorthStar than ( j ) , that is, being more qualiﬁed implies a shorter distance to the North Star. The converse is not generallytrue. This implication is formally stated in the following proposition: Proposition 1.

If, according to Deﬁnition 6, an observation ( i ) is more qualiﬁed than observation ( j ) , then d ′′ ( i, ⋆ ) is strictly smaller than d ′′ ( j, ⋆ ) , where d ′′ is deﬁned as in Equation (5) .Proof. Assume ( i ) is more qualiﬁed than ( j ) , and w.l.o.g. assume that all legitimate features are scaled as in Algorithm1. Then we have: z ( i ) ℓ ≥ z ( j ) ℓ ∀ ℓ ∈ { , . . . , L } and ∃ ℓ ′ : z ( i ) ℓ ′ > z ( j ) ℓ ′ . With ψ ℓ := ω ℓ (1 − e ρ ℓ ) ∈ [0 , and ψ ℓ ′ = 0 , we then obtain: d ′′ ( i, ⋆ ) = L X ℓ =1 ψ ℓ (cid:16) − z ( i ) ℓ (cid:17) = L X ℓ =1 ψ ℓ − (cid:16) ψ z ( i )1 + · · · + ψ ℓ ′ z ( i ) ℓ ′ + · · · + ψ L z ( i ) L (cid:17) < L X ℓ =1 ψ ℓ − (cid:16) ψ z ( j )1 + · · · + ψ ℓ ′ z ( j ) ℓ ′ + · · · + ψ L z ( j ) L (cid:17) = d ′′ ( j, ⋆ ) , since ψ ℓ ′ z ( i ) ℓ ′ > ψ ℓ ′ z ( j ) ℓ ′ and ψ ℓ z ( i ) ℓ ≥ ψ ℓ z ( j ) ℓ for all other ℓ ∈ { , . . . , L } \ { ℓ ′ } .Note that in Table 1, according to Deﬁnitions 4 and 6, observation 11 is treated unfairly over both observations 1 and2. We will show that this can not happen with our method. In this section, we summarize the previous ﬁndings and formalize our idea of a fair ranking-based classiﬁcationalgorithm. The proposed method can: a) fairly rank a given set of observations, b) propose new labels for the givenobservations, and c) rank and classify previously unseen observations. For a), we compute d ′′ for all observations andrank them by distance. After ranking, we reset the indices such that observation (1) has the smallest distance to theNorth Star and ( N ) the largest. For b) and c), we need to deﬁne a capacity threshold α ∈ (0 , . This could be, for6 REPRINT instance, a given admission rate. Alternatively, we can set α to the share of positive outcomes within D . Knowing α ,we can then determine the cutoff point ν := ⌈ αN ⌉ , such that the top- ν observations are assigned the positive outcome (+) and the rest is assigned the negative outcome ( − ) . To infer a predictor, we compute: δ := ( d ′′ ( ν, ⋆ ) + d ′′ ( ν + 1 , ⋆ ))2 , as the average distance of observations ( ν ) and ( ν + 1) to the North Star. Note that ( ν ) is the last observation withpositive outcome, and ( ν + 1) is the ﬁrst observation with negative outcome.Ultimately, to classify a previously unseen observation ( u ) , we need to scale its legitimate features according toAlgorithm 1—using the min and max feature values as observed in D —and measure the distance d ′′ ( u, ⋆ ) to theNorth Star. The inferred predictor would then be: b Y ( u ) = (cid:26) (+) if d ′′ ( u, ⋆ ) ≤ δ ( − ) otherwise . The proposed method is summarized in Algorithm 2. Note that from Proposition 1, it follows that meritocratic unfair-

Algorithm 2:

Fair ranking-based classiﬁcation algorithm.

Input :

Data set D ; ( ↑ ) or ( ↓ ) relationships for legitimate features; threshold α . Output:

Ranked and classiﬁed observations (1) , . . . , ( N ) ; predictor b Y .Compute Z as in Algorithm 1;Set z ( ⋆ ) ℓ ← ∀ ℓ ∈ { , . . . , L } ;Obtain ω , . . . , ω L from learned classiﬁer; for ℓ ∈ { , . . . , L } do e ρ ℓ ← max k ∈{ ,...,K } {| ρ a k ,z ℓ |} as in Equation (4); endfor i ∈ { , . . . , N } do d ′′ ( i, ⋆ ) ← P Lℓ =1 ω ℓ (1 − e ρ ℓ ) (cid:16) − z ( i ) ℓ (cid:17) as in Equation (5);Assign observation ( i ) the distance d ′′ ( i, ⋆ ) ; end Rank observations by distance d ′′ and reset indices such that (1) has smallest distance;Deﬁne cutoff point ν ← ⌈ αN ⌉ ;Assign (+) to top- ν observations and ( − ) to rest;Deﬁne δ ← ( d ′′ ( ν,⋆ )+ d ′′ ( ν +1 ,⋆ ) ) ; if d ′′ ( u, ⋆ ) ≤ δ for a (potentially unseen) scaled observation ( u ) then b Y ( u ) ← (+) ; else b Y ( u ) ← ( − ) ; end ness can not occur with our method: Corollary 1.

Meritocratic unfairness, as stated in Deﬁnitions 4 and 6, can not occur if observations are ranked andclassiﬁed as in Algorithm 2.Proof.

From Proposition 1, we conclude that if ( i ) is more qualiﬁed than ( j ) , then d ′′ ( i, ⋆ ) will be strictly smaller than d ′′ ( j, ⋆ ) . But by construction of the ranking in Algorithm 2, we will then have ( i ) ranked higher than ( j ) , which alsoimplies that if ( j ) is assigned (+) , then ( i ) as well. As explained in Section 2, FTA [Dwork et al., 2012] is one of the most prominent concepts of individual fairness,which is often verbalized as “treating similar individuals similarly.” However, it is often not immediately clear howto measure similarity of individuals. Algorithm 2 ranks observations based on their (weighted) distance to the NorthStar, d ′′ ( · , ⋆ ) . Hence, by construction, if observations ( i ) and ( j ) have (relatively) similar distances to the the NorthStar, then their rankings will be similar as well. Speciﬁcally, for observations ( i ) , ( j ) , ( k ) , and rk ( i ) ≻ rk ( j ) ≻ rk ( k ) ,7 REPRINT with rk denoting the ranking of an observation and “ ≻ ” being used as a shorthand for “higher than”, the followingtwo inequalities will always hold: d ′′ ( k, ⋆ ) − d ′′ ( i, ⋆ ) > d ′′ ( j, ⋆ ) − d ′′ ( i, ⋆ ) d ′′ ( k, ⋆ ) − d ′′ ( i, ⋆ ) > d ′′ ( k, ⋆ ) − d ′′ ( j, ⋆ ) . However, having a similar distance to the North Star—hence, a similar ranking—does not imply that the respectiveobservations are similar in a literal sense. For instance, in our running graduate school admission example, the twoobservations in Table 2 would have the same distance to the North Star, despite being fundamentally different intheir feature values of GRE V and GRE AW. We argue that this is a desirable property, as it allows individuals withTable 2: Two observations with equal distance to the North Star.GRE V GRE Q GRE AWObservation ( i )

170 160 3.0Observation ( j )

140 160 6.0heterogeneous (but equally important/desirable) skill sets to achieve the positive outcome—at least to the extent that ω ℓ and e ρ ℓ , ℓ ∈ { , . . . , L } , allow.On the other hand, it would be difﬁcult to justify an algorithm that assigns signiﬁcantly different outcomes to similarindividuals. This is, in fact, the reasoning behind FTA. We will now show that our proposed method respects thisrequirement—precisely, that similar individuals are guaranteed to have similar distances to the North Star, and thus,similar rankings. But ﬁrst, we deﬁne the similarity of two observations in terms of their weighted distance to eachother in the feature space. Deﬁnition 7 (Similarity of two observations) . We measure the similarity of two observations ( i ) and ( j ) by their(weighted) taxicab distance to each other, similar to Equation (5) : d ′′ ( i, j ) := L X ℓ =1 ω ℓ (1 − e ρ ℓ ) · (cid:12)(cid:12)(cid:12) z ( i ) ℓ − z ( j ) ℓ (cid:12)(cid:12)(cid:12) . Again, we have the symmetry d ′′ ( i, j ) = d ′′ ( j, i ) . The following proposition now says that if two observations ( i ) and ( j ) are ε -similar, then the difference in their respective distances to the North Star will be bounded by ε . Proposition 2.

If two observations ( i ) and ( j ) are ε -similar, that is, d ′′ ( i, j ) = ε , ε ≥ , then the following holds: | d ′′ ( i, ⋆ ) − d ′′ ( j, ⋆ ) | ≤ ε. Proof.

Let d ′′ ( i, j ) = ε , and deﬁne ψ ℓ := ω ℓ (1 − e ρ ℓ ) . Then we have: d ′′ ( i, j ) = L X ℓ =1 ψ ℓ · (cid:12)(cid:12)(cid:12) z ( i ) ℓ − z ( j ) ℓ (cid:12)(cid:12)(cid:12) = ε. And further, with ψ ℓ ≥ and the triangle inequality: | d ′′ ( i, ⋆ ) − d ′′ ( j, ⋆ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X ℓ =1 ψ ℓ (cid:16) − z ( i ) ℓ (cid:17) − L X ℓ =1 ψ ℓ (cid:16) − z ( j ) ℓ (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X ℓ =1 ψ ℓ z ( i ) ℓ − L X ℓ =1 ψ ℓ z ( j ) ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L X ℓ =1 ψ ℓ (cid:16) z ( i ) ℓ − z ( j ) ℓ (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L X ℓ =1 ψ ℓ · (cid:12)(cid:12)(cid:12) z ( i ) ℓ − z ( j ) ℓ (cid:12)(cid:12)(cid:12) = ε. This shows the desired result.Now, if we let ε become small, that is, ε → , then the observations and their respective distances to the NorthStar are becoming increasingly similar—and in the limit equal. Hence, those observations will be ranked adjacently,everything else unchanged. 8 REPRINT

In this section, we instantiate our proposed method on the widely-used German Credit data set [Dua and Graff, 2017]from 1994. The data set is made up of 1,000 observations classiﬁed as good (70%) or bad (30%) credits ( Y ). Assummarized by Pedreshi et al. [2008], it includes 20 features on a) personal belongings (e.g., checking account status , savings status , property ), b) past/current credits and requested credit (e.g., credit history , credit request amount ), c)employment status (e.g., job type , employment since ), and d) personal attributes (e.g., personal status and gender , age , foreign worker ). Pre-processing of data

From the original data set we exclude certain features from consideration as they either a)do not exhibit an obvious monotonic relationship with the outcome, or b) appear to be irrelevant for deciding whetherto grant a loan or not (e.g., telephone ). The remaining features are shown in Table 3. In accordance with the law, weTable 3: Features of the German Credit data set after pre-processing.Feature Description A or X ( ↑ ) or ( ↓ ) ω ℓ e ρ ℓ personal status and gender marital status and gender A – – – age age of person A – – – foreign worker foreign worker yes/no A – – – checking account status money in checking X ( ↑ ) . ( σ = 0 . ) . savings status money in savings X ( ↑ ) . ( σ = 0 . ) . property value of property X ( ↑ ) . ( σ = 0 . ) . type of housing free/rent/own X ( ↑ ) . ( σ = 0 . ) . credit history quality of credit history X ( ↑ ) . ( σ = 0 . ) . credit request amount credit amount requested X ( ↓ ) . ( σ = 0 . ) . job type unempl./un-/skilled/mgmt. X ( ↑ ) . ( σ = 0 . ) . employment since how long employed X ( ↑ ) . ( σ = 0 . ) . further separate the remaining features into protected and legitimate features. We determine the relationships ( ↑ or ↓ )as depicted in Table 3. For evaluation purposes later on, we randomly shufﬂe the data and set aside 200 observationsfor testing purposes, 150 of which are labeled as having good credit. Experimental setup

First, we scale the legitimate features X as in Algorithm 1. Then, we ﬁt a random forestclassiﬁer with bootstrapping to predict Y from X . We repeat this ﬁve times, and for each model, we randomly permutethe features ten times—this results in 50 estimates of importance for each legitimate feature. The average numbers(including standard deviations) are displayed as ω ℓ in Table 3. Note that the values of feature importance are onlymeaningful if the underlying model predicts Y reasonably well. In our case, we obtain average accuracies of 79.4%(training) and 78.7% (testing). Following Algorithm 2, we next compute the maximum absolute rank correlations e ρ ℓ for each legitimate feature (see Table 3).We conduct several experiments to rank 200 test observations and predict good or bad credit. To that end, we traina logistic regression classiﬁer for the following scenarios: a) using all available features including protected features( LogReg all ), and b) omitting protected features (

LogReg FTU ). Third, c) we apply our proposed method to the testobservations.

Evaluation criteria and results

To evaluate the results from scenarios a)–c), we ﬁrst compare the rankings inducedby the respective methods: For a) and b), we rank observations based on the prediction probabilities returned by theclassiﬁer, and for c), the ranking is obtained as in Algorithm 2. We measure fairness of a ranking by the number ofunfairly treated observations, as speciﬁed in Deﬁnitions 4 and 6. In general, an observation can be treated unfairlyover more than one other observation. For the baseline models, we therefore measure both the share S of individualobservations that are treated unfairly and the total number T of instances where meritocratic unfairness occurs. Theresults are depicted in Table 4. Additionally, we also provide numbers on the meritocratic unfairness of the testlabels—where we assume that observations with good credit are ranked higher than observations with bad credit. Wenote that LogReg all produces both the highest S and the highest T , and LogReg FTU performs only marginallybetter.In reality, observations will primarily be affected by the actual outcome of the decision-making task— good or bad credit. Hence, we also compare scenarios a)–c) with respect to the predicted outcome, dependent on the choice ofa threshold α . Speciﬁcally, we calculate S for α ∈ { , . , . , . . . , } and each scenario. Note that for the logistic9 REPRINT

Table 4: Meritocratic unfairness and accuracy of different scenarios for the German Credit data set.

LogReg all LogReg FTU

Our Method Test Labels S . (cid:0) (cid:1) . (cid:0) (cid:1) .

0% 14 . (cid:0) (cid:1) T

616 596 0 222

Accuracy .

5% 76 .

5% 56 .

0% 100% regression models, the positive outcome ( good credit) is assigned to the (100 · α )% observations with the highestprediction probabilities. From Figure 1, we conclude that, apart from the trivial cases of α = 0 and α = 1 , bothbaseline models involve high percentages of unfairly treated observations—with LogReg all reaching values of morethan 50% for α = 0 . and α = 0 . . . . . . α S [ % ] LogReg allLogReg FTU

Our Method

Figure 1: Share S of unfairly treated observations over α for different scenarios. Here, S is calculated based on thepredicted labels , not the ranking.For completeness, we also include the models’ accuracy with respect to the test labels in Table 4. However, notethat accuracy is measured based on an imperfect and potentially biased proxy (i.e., the test labels) of the ground-truth labels regarding the qualiﬁcation of individuals. Hence, a drop in accuracy, as observed for our method inTable 4, may be explained by a strong mismatch between available imperfect labels and true (but unavailable) labels.Unfortunately, we do not have a way to control the level of label bias in real-world data. For that reason, we conduct aseries of experiments on synthetic data and present evidence that, in fact, our method’s accuracy tends to be a) similarto traditional classiﬁcation models’ when label bias is low, and b) lower when label bias is high, implying that lowaccuracy and desirable outcomes need not contradict each other.

In order to better understand the previous results, we evaluate our method extensively on synthetic data with imperfectlabels. To that end, we take a more in-depth look at the simpliﬁed graduate school admission data introduced in Section3. Recall that we sampled the GRE scores from multivariate Gaussian distributions according to the gender-speciﬁcmeans and standard deviations provided by ETS [2019a,b]. Also, we included an equal amount of females and males,respectively, in the data set of overall 1,000 observations.

Experimental setup

For the purpose of evaluating our method, we simulate historical admission decisions/labels(e.g., of a technical university) ﬁrst by computing a score R for each observation as the weighted sum of its (scaled)feature values: R := ζζ + 4 · male + 1 ζ + 4 · GRE V + 2 ζ + 4 · GRE Q + 1 ζ + 4 · GRE AW + ǫ, with ζ ≥ and noise ǫ ∈ N (0 , . , the latter of which might reﬂect the (unpredictable) mood of the admissionscommittee or other circumstances that affected admission decisions in the past. Note that the feature weights sumup to 1, and that the weights of GRE V and GRE AW are the same. Moreover, the inﬂuence of GRE Q on R isapproximately twice as high as compared to the other GRE scores, in order to mimic a more quantitative-focusedadmission process. The positive outcome (+) is then initially assigned to observations with R > . , and ( − ) is We set α = 0 . to ensure comparability with the test labels. REPRINT assigned otherwise—this ensures a well-balanced label distribution. Yet, those generated labels are imperfect (i.e., not ground truth) because a) the score R is only a noisy signal of potential success in graduate school, b) the computationof R involves (simulated) human subjectivity and error, and c) R may be discriminatory, depending on the choice of ζ .The parameter ζ lets us control the amount of direct discrimination [Mehrabi et al., 2019] in the decisions, as it directlyincreases R for males and decreases it for females. Besides, a large ζ could also be an indicator of indirect discrimi-nation [Mehrabi et al., 2019], for instance if other features highly correlated with gender—and favoring males—weregiven strong weight in the (simulated) historical decisions. Note that as ζ becomes increasingly large, the directinﬂuence of the legitimate features (GRE scores) on R vanishes.We evaluate our method on seven synthetic data sets with varying levels of bias/discrimination in the labels, as con-trolled through ζ (see Table 5). Like in Section 4, we randomly set aside 200 observations for testing on each dataset. Our method is implemented according to Algorithm 2, with the GRE features being legitimate, and gender beingprotected. Naturally, higher GRE scores should be more beneﬁcial towards being admitted—hence ( ↑ ) relationshipswith the outcome. An overview of all ω ℓ and e ρ ℓ values is given in Table 6. Note that e ρ ℓ is constant across the datasets, as changing ζ only affects the label distribution, not the correlations among features. We also like to highlightthat feature importances ( ω ℓ ) still capture well the policy that GRE Q should carry signiﬁcantly more weight in thedecision process than GRE V and GRE AW—even with relatively high levels of bias in the labels ( ζ = 3 . ). Results and interpretation

As previously, we ﬁrst compare the rankings of our method against the rankings inducedby the logistic regression models

LogReg all and

LogReg FTU on the test data. The resulting levels of meritocraticunfairness (both S and T ) are displayed in Table 5, including the statistics for the test labels. We also report accuracy aswell as additional statistics regarding the admission of females and males for each scenario, based on label predictions.Speciﬁcally, we report the ratio of admitted females to males and compare the admission rates per gender. For labelpredictions with our method, to ensure comparability, we set α equal to the share of admitted applicants in the test set.From Table 5 and Figures 2–4, we infer several observations: First, as ζ increases, the admission ratio of females tomales in the data sets decreases, as expected, whereas the overall admission rate remains stable (between 54% and60% in the test labels). The fact that increasing ζ results in the labels depending stronger on the value of gender is . . . ζ S [ % ] LogReg allLogReg FTU

Our MethodTest Labels

Figure 2: Share S of unfairlytreated observations over ζ for dif-ferent scenarios. . . . ζ A cc u r ac y [ % ] LogReg allLogReg FTU

Our MethodTest Labels

Figure 3: Accuracy over ζ for dif-ferent scenarios. . . . . . . . ζ A d m i ss i on F e m a l e / M a l e LogReg allLogReg FTU

Our MethodTest Labels

Figure 4: Admission ratio of fe-males to males over ζ for differentscenarios.exploited by LogReg all to discriminate observations based thereon. Not surprisingly, as ζ increases, LogReg all clings to the trajectory of the test labels both for accuracy as well as meritocratic unfairness and demographic parity(with respect to admission ratios), making its predictions accurate but blatantly unfair—both meritocratically and withrespect to admission rates by gender. We further observe in Figure 2 that the gender-agnostic

LogReg FTU modelremoves meritocratic unfairness when label bias is low ( ζ < . ). However, as the level of discrimination in the dataincreases, it fails to remove such unfairness, with S surging and even surpassing the levels of LogReg all and thetest labels for growing ζ . These problems do not occur with our method, which always achieves zero meritocraticunfairness. Additionally, as can be seen in Figure 4, our experiments suggest that enforcing individual meritocraticfairness results in higher group fairness (here: demographic parity with respect to admission rates) as well: While thelogistic regression models both exhibit a negative relationship between ζ and demographic parity, our method satisﬁesa constant high level of group fairness for any ζ , similar to the one of the test labels without explicit discrimination( . for ζ = 0 . ). Note that the converse—group fairness implying individual meritocratic fairness—is not generally11 REPRINT

Table 5: Meritocratic unfairness, accuracy, and admission statistics on synthetic data with varying levels of discrimi-nation ζ . LogReg all LogReg FTU

Our Method Test Labels ζ = 0 . S . (cid:0) (cid:1) .

0% 0 .

0% 20 . (cid:0) (cid:1) T

45 0 0 177

Accuracy .

5% 83 .

5% 82 . ( α = 0 . ) . Admission Female/Male .

62 0 .

72 0 . ( α = 0 . ) . Admission Rate Female .

4% 59 .

8% 47 . ( α = 0 . ) . Admission Rate Male .

9% 70 .

4% 68 . ( α = 0 . ) . ζ = 0 . S . (cid:0) (cid:1) .

0% 0 .

0% 18 . (cid:0) (cid:1) T

503 0 0 230

Accuracy .

5% 77 .

5% 80 . ( α = 0 . ) . Admission Female/Male .

33 0 .

66 0 . ( α = 0 . ) . Admission Rate Female .

6% 53 .

3% 42 . ( α = 0 . ) . Admission Rate Male .

3% 68 .

5% 67 . ( α = 0 . ) . ζ = 1 S . (cid:0) (cid:1) .

0% 0 .

0% 31 . (cid:0) (cid:1) T ,

200 0 0 608

Accuracy .

0% 71 .

0% 73% ( α = 0 . ) . Admission Female/Male .

13 0 .

62 0 . ( α = 0 . ) . Admission Rate Female .

1% 48 .

9% 47 . ( α = 0 . ) . Admission Rate Male .

7% 67 .

6% 70 . ( α = 0 . ) . ζ = 1 . S . (cid:0) (cid:1) . (cid:0) (cid:1) .

0% 36 . (cid:0) (cid:1) T ,

571 48 0 797

Accuracy .

5% 68 .

5% 71 . ( α = 0 . ) . Admission Female/Male .

03 0 .

55 0 . ( α = 0 . ) . Admission Rate Female .

3% 41 .

3% 42 . ( α = 0 . ) . Admission Rate Male .

3% 63 .

9% 67 . ( α = 0 . ) . ζ = 2 S . (cid:0) (cid:1) . (cid:0) (cid:1) .

0% 37 . (cid:0) (cid:1) T ,

630 324 0 955

Accuracy .

5% 68 .

0% 68 . ( α = 0 . ) . Admission Female/Male .

01 0 .

52 0 . ( α = 0 . ) . Admission Rate Female .

1% 41 .

3% 41 . ( α = 0 . ) . Admission Rate Male .

1% 67 .

6% 65 . ( α = 0 . ) . ζ = 2 . S . (cid:0) (cid:1) . (cid:0) (cid:1) .

0% 40 . (cid:0) (cid:1) T ,

631 731 0 1 , Accuracy .

0% 68 .

0% 65 . ( α = 0 . ) . Admission Female/Male .

00 0 .

47 0 . ( α = 0 . ) . Admission Rate Female .

0% 37 .

0% 41 . ( α = 0 . ) . Admission Rate Male .

0% 66 .

7% 64 . ( α = 0 . ) . ζ = 3 S . (cid:0) (cid:1) . (cid:0) (cid:1) .

0% 41 . (cid:0) (cid:1) T ,

631 1 ,

364 0 1 , Accuracy .

5% 69 .

0% 62 . ( α = 0 . ) . Admission Female/Male .

00 0 .

42 0 . ( α = 0 . ) . Admission Rate Female .

0% 32 .

6% 44 . ( α = 0 . ) . Admission Rate Male .

0% 65 .

7% 61 . ( α = 0 . ) . REPRINT

Table 6: Overview of seven synthetic data sets with varying levels of discrimination ζ . ζ = 0 . ζ = 0 . ζ = 1 . ζ = 1 . ζ = 2 . ζ = 2 . ζ = 3 . Feature A or X ( ↑ ) or ( ↓ ) ω ℓ ω ℓ ω ℓ ω ℓ ω ℓ ω ℓ ω ℓ e ρ ℓ gender A – – – – – – – – – GRE V X ( ↑ ) .

26 0 .

21 0 .

22 0 .

21 0 .

22 0 .

24 0 .

27 0 . GRE Q X ( ↑ ) .

66 0 .

71 0 .

72 0 .

70 0 .

63 0 .

56 0 .

49 0 . GRE AW X ( ↑ ) .

08 0 .

07 0 .

09 0 .

16 0 .

20 0 .

24 0 . true, for instance if a model randomly admits an equal share of females and males without paying any attention to theirqualiﬁcation. In this paper, we present a practical and easy-to-implement approach for fair ranking and binary classiﬁcation, asan alternative to traditional classiﬁcation algorithms. Given the common setting of data with (potentially biased)imperfect labels, our method ranks observations according to their qualiﬁcation for a speciﬁc outcome, for instanceadmission to graduate school, regardless of protected features like gender or race. Instead of learning to predictimperfect labels, we introduce an idea to incorporate useful information from historical decisions in our decisioncriterion. Additionally, we account for unwanted dependencies between (seemingly) legitimate and protected features.We show theoretically that our method respects a version of the prominent concept of fairness through awareness .Experiments on synthetic and real-world data conﬁrm that our method yields desirable results both with respect tomeritocratic fairness and group fairness (e.g., similar admission rates for females and males), clearly outperformingtraditional classiﬁcation algorithms trained on data with biased/imperfect labels. Our work allows for several directionsof follow-up research: For instance, it would be interesting to elaborate more on how to meaningfully include featuresthat do not exhibit obvious monotonic relationships with the outcome. Another natural extension of our method couldinvolve a more sophisticated way of accounting for feature interactions. Ultimately, we hope that our work will equip(especially) practitioners with helpful new tools for designing equitable decision systems.

References ,2019. Last accessed on Feb 8, 2021.Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behavior of distance metrics in highdimensional space. In

International Conference on Database Theory , pages 420–434. Springer, 2001.Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias.

ProPublica, May , 23(2016):139–159,2016.Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. Equity of attention: Amortizing individual fairness inrankings. In

The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval ,pages 405–414, 2018.Leo Breiman. Random forests.

Machine Learning , 45(1):5–32, 2001.Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Opti-mized pre-processing for discrimination prevention. In

Advances in Neural Information Processing Systems , pages3992–4001, 2017.Claude Castelluccia and Daniel Le M´etayer.

Understanding algorithmic decision-making: Opportunities and chal-lenges . European Parliament, 2019.Carlos Castillo. Fairness and transparency in ranking. In

ACM SIGIR Forum , volume 52, pages 64–71. ACM, 2019.L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. Ranking with fairness constraints. In . Schloss Dagstuhl — Leibniz Center forInformatics, 2018.Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and thecost of fairness. In

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery andData Mining , pages 797–806, 2017. 13

REPRINT

Dheeru Dua and Casey Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml , 2017. Lastaccessed on Feb 8, 2021.Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference , pages 214–226, 2012.ETS. A snapshot of the individuals who took the GRE General Test. , 2019a. Last accessed onFeb 8, 2021.ETS. GRE guide to the use of scores 2019–20. , 2019b. Lastaccessed on Feb 8, 2021.Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P Gummadi, and Adrian Weller. The case for process fairnessin learning: Feature selection for fair decision making. In

NIPS Symposium on Machine Learning and the Law ,volume 1, page 2, 2016.Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In

Advances in NeuralInformation Processing Systems , pages 3315–3323, 2016.Lily Hu and Issa Kohler-Hausmann. What’s sex got to do with machine learning? In

Proceedings of the 2020Conference on Fairness, Accountability, and Transparency , pages 513–513, 2020.Faisal Kamiran and Toon Calders. Classifying without discriminating. In , pages 1–6. IEEE, 2009.Faisal Kamiran and Toon Calders. Data preprocessing techniques for classiﬁcation without discrimination.

Knowledgeand Information Systems , 33(1):1–33, 2012.Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classiﬁer with prejudice re-mover regularizer. In

Joint European Conference on Machine Learning and Knowledge Discovery in Databases ,pages 35–50. Springer, 2012.Michael Kearns, Aaron Roth, and Zhiwei Steven Wu. Meritocratic fairness for cross-population selection. In

Interna-tional Conference on Machine Learning , pages 1828–1836, 2017.Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, et al. Avoidingdiscrimination through causal reasoning. In

Advances in Neural Information Processing Systems , pages 656–666,2017.Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Sch¨olkopf, Krikamol Muandet, and Isabel Valera. Fair decisionsdespite imperfect predictions. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 277–287,2020.Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In

Advances in NeuralInformation Processing Systems , pages 4066–4076, 2017.Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. The selective labelsproblem: Evaluating algorithmic predictions in the presence of unobservables. In

Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 275–284, 2017.Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias andfairness in machine learning. arXiv preprint arXiv:1908.09635 , 2019.Shahien Nasiripour and Sridhar Natarajan. Apple co-founder says Goldman’s Apple card algorithm discriminates.Bloomberg, 2019.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, et al. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In

Proceedings of the 14thACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 560–568, 2008.Ashudeep Singh and Thorsten Joachims. Equality of opportunity in rankings. In

NeurIPS Workshop on PrioritisingOnline Content , page 31, 2017.Ashudeep Singh and Thorsten Joachims. Fairness of exposure in rankings. In

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , pages 2219–2228, 2018.Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, et al. Scipy 1.0: Fundamentalalgorithms for scientiﬁc computing in Python.

Nature Methods , 17:261–272, 2020.Serena Wang and Maya Gupta. Deontological ethics by monotonicity shape constraints. In

International Conferenceon Artiﬁcial Intelligence and Statistics , pages 2043–2054. PMLR, 2020.14

REPRINT

Ke Yang and Julia Stoyanovich. Measuring fairness in ranked outputs. In

Proceedings of the 29th InternationalConference on Scientiﬁc and Statistical Database Management , pages 1–6, 2017.Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparatetreatment & disparate impact: Learning classiﬁcation without disparate mistreatment. In

Proceedings of the 26thInternational Conference on World Wide Web , pages 1171–1180, 2017a.Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints:Mechanisms for fair classiﬁcation. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 962–970. PMLR, 2017b.Meike Zehlike and Carlos Castillo. Reducing disparate exposure in ranking: A learning to rank approach. In

Proceed-ings of The Web Conference 2020 , pages 2849–2855, 2020.Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, et al. FA*IR: A fair top-kranking algorithm. In

Proceedings of the 2017 ACM Conference on Information and Knowledge Management ,pages 1569–1578, 2017.Indr˙e ˇZliobait˙e. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148arXiv preprint arXiv:1511.00148