[PDF] Everything is Relative: Understanding Fairness with Optimal Transport

Abstract

To study discrimination in automated decision-making systems, scholars have proposed several definitions of fairness, each expressing a different fair ideal. These definitions require practitioners to make complex decisions regarding which notion to employ and are often difficult to use in practice since they make a binary judgement a system is fair or unfair instead of explaining the structure of the detected unfairness. We present an optimal transport-based approach to fairness that offers an interpretable and quantifiable exploration of bias and its structure by comparing a pair of outcomes to one another. In this work, we use the optimal transport map to examine individual, subgroup, and group fairness. Our framework is able to recover well known examples of algorithmic discrimination, detect unfairness when other metrics fail, and explore recourse opportunities.

Full PDF

EEverything is Relative: Understanding Fairness withOptimal Transport

Kweku Kwegyir-Aggrey , Rebecca Santorella , and Sarah M. Brown Department of Computer Science, Brown University Division of Applied Mathematics, Brown University Department of Computer Science, University of Rhode Island

Abstract

To study discrimination in automated decision-making systems, scholars have pro-posed several deﬁnitions of fairness, each expressing a diﬀerent fair ideal. Thesedeﬁnitions require practitioners to make complex decisions regarding which notion toemploy and are often diﬃcult to use in practice since they make a binary judgement asystem is fair or unfair instead of explaining the structure of the detected unfairness.We present an optimal transport-based approach to fairness that oﬀers an interpretableand quantiﬁable exploration of bias and its structure by comparing a pair of outcomesto one another. In this work, we use the optimal transport map to examine individual,subgroup, and group fairness. Our framework is able to recover well known examplesof algorithmic discrimination, detect unfairness when other metrics fail, and explorerecourse opportunities.

Machine learning-based decisions in sensitive domains such as criminal justice Lum andIsaac [2016] or banking Munkhdalai et al. [2019] have come under signiﬁcant scrutiny dueto their ability to amplify, re-invent, and create patterns of discriminatory behavior againstalready marginalized and vulnerable groups. Many scholars have attempted to preventalgorithmic discrimination by engineering fair algorithmic solutions. These eﬀorts Zafar et al.[2017], Hardt et al. [2016], Woodworth et al. [2017] rely on the assumption that if it can beproven or demonstrated that a classiﬁer exhibits fair behavior with respect to some agreeablenotion of fairness, this classiﬁer will no longer produce discriminatory outcomes. In contrast,critical scholars instead assert that discrimination by algorithmic systems is a symptom ofbroader systematic and historic forms of oppression Selbst et al. [2019], which cannot beameliorated through fair algorithmic solutionism. We suggest that the key tension betweenthese viewpoints relies on resolving the following question:

If machine learning can be usedto examine the discrimination present in algorithmic solutions, how can bias be meaningfullyquantiﬁed and studied within this context to prevent further discriminatory outcomes? In this framing, we view bias as measurement of discrimination relative to some fair ideal. a r X i v : . [ c s . C Y ] F e b espite countless examples of bias in automated decision making, a precise deﬁnitionof fairness in this context remains elusive. Philosophical and legal doctrine have proposedseveral conceptualizations of fairness such as disparate impact and equal opportunity, but nosingle fair notion has become standard or ubiquitous. Instead, each fairness notion exists todescribe idealized fair behavior in a diﬀerent context or application. The machine learningcommunity approaches fairness similarly, usually by translating some abstract notion into amathematical expression and then declaring classiﬁers as fair if they satisfy said expression.We consider two types of fair criteria: individual fairness Dwork et al. [2012] and groupfairness. The ﬁrst postulates that similar individuals should be treated similarly, regardlessof protected group membership such as race or gender, while the second considers the contextof these protected attributes and looks to achieve some notion of parity between groups. Forexample, the group fairness notion of demographic parity suggests that fairness is achievedwhen a classiﬁer’s predictions are independent of individuals’ protected attributes.Although these fair deﬁnitions are tractable and succinct, they often lack suﬃcient nuanceto examine fairness in the context of real sociotechnical systems Hoﬀmann [2019]. Forexample, it has been shown that binary assumptions about protected group membership limitthe applicability of fair algorithms Hanna et al. [2020]. While these assumptions make somefairness problems tractable, Kearns et al. [2018] demonstrated that they are not suﬃcient tocompute fair outcomes over more complex groups such as a set of protected groups that mayintersect . Moreover, these fairness deﬁnitions can suggest that a system is fair or unfairbut are unable to provide actionable recourse as to how to amend the bias, further renderingthese deﬁnitions deﬁcient in practice Corbett-Davies and Goel [2018].Our solution is to evoke an understanding of fairness that does not rely on closed-formdeﬁnitions. Instead, we propose that a set of outcomes can only be deemed unfair if itdeviates strongly from a set of outcomes that are deemed fair. In other words, fairness is aproperty that can be evaluated by comparing sets of outcomes. This intuition can be madetractable by viewing sets of outcomes as probability distributions and then leveraging somestatistical diﬀerence measure to quantify the diﬀerences between outcomes. Example 1.1.

A university admissions process may be considered unfair if it admits cohortsthat are not suﬃciently representative or diverse. This complaint follows from comparingthe admitted cohort to another, possibly hypothetical, fair cohort. This notion of fairnessdoes not necessarily require parity in classiﬁcation - not all student groups must be admittedwith the same frequency - which diﬀerentiates this intuition from current fair algorithmicpractices.We advocate for the Wasserstein distance as the best statistical distance metric to compareoutcome distributions. The Wasserstein distance has many advantages over other distributionmetrics such as KL-Divergence or Total Variation distance (we refer the reader to Arjovskyet al. [2017] for a summary of these advantages); however, there are two advantages we wouldlike to highlight. First, the Wasserstein distance can be computed between distributionsthat are not deﬁned over the same supports. This ﬂexibility means that we can compareunfairness across two seemingly disparate groups of observations. Additionally, computing the Our intersectional understanding of identity is based in the notion of Intersectionality Theory Crenshaw[2017]

To circumvent the limitations of the deﬁnition-based fairness approach, we provide an optimaltransport-based framework for exploring and quantifying bias in automated systems. Wecompute the Wasserstein distance between outcomes for diﬀerent groups to understand howone set of outcomes may be transformed into another. This distance is able to detect biaswhen other fairness metrics fail, and we can leverage the transport map used in the distancecalculation to quantify individual, subgroup, and group bias. This approach allows us torecover biases previously found in the COMPAS and German credit datasets, as well as oﬀerpossible actions for recourse.

In this section, we introduce key deﬁnitions in fairness and relevant mathematical backgroundon optimal transport.

Allow

X ⊆ R k to be some set of features and Y = { , } to be a binary set of possibleoutcomes for some decision. We assume that S = { , } describes the protected groupmembership of an individual. A classiﬁer is a function h : X → { , } which outputs aclassiﬁcation decision h ( x ) for each individual x ∈ X . We assume there exists a distribution D over the set Z = X × S × Y , which takes on values ( x, s, y ) ∼ D . Below we review common deﬁnitions of group fairness that we will use as points of comparison.

Deﬁnition 2.1 (Demographic Parity) . A classiﬁer is fair under demographic parity if theclassiﬁer’s predictions are independent from the sensitive attribute Calders et al. [2009]: E D [ h ( X ) |S = 1] = E D [ h ( X ) |S = 0] . (1)Chouldechova [2016] demonstrated that satisfying demographic parity requires eachprotected group to have the same distribution of features. Most data distributions violatethis condition, and in that case, enforcing demographic parity can have an adverse eﬀect onclassiﬁer performance Zafar et al. [2017], Lohaus et al. [2020], Hu and Chen [2020]. To avoidthis performance loss, diﬀerence of demographic parity (DDP) Wu et al. [2019] relaxes thestrict independence assumption to, DDP ( f ) := E D [ h ( X ) |S = 1] − E D [ h ( X ) |S = 0] , , and does not provide additional information regarding thestructure of observed unfairness.Equal opportunity Hardt et al. [2016] achieves fairness by equalizing the per-group truepositive rates of a classiﬁer. Deﬁnition 2.2 (Equal Opportunity) . A classiﬁer is fair under equal opportunity if itspredictions have the same true positive rates across sensitive attribute membership: E D [ h ( X ) | Y = 1 , S = 1] = E D [ h ( X ) | Y = 1 , S = 0] . Although equal opportunity diﬀers from demographic parity in that it can be enforceddespite diﬀerence in group conditioned feature distributions, the two expressions share manyof the same limitations such as the inability to examine the structure of unfairness. We referthe reader to Corbett-Davies and Goel [2018] for a more complete analysis of parity-basedfairness deﬁnitions.

Individual fairness Dwork et al. [2012] is summarized by the intuition that “similar individualsshould be treated similarly.” Although this deﬁnition is semantically appealing, choosinga reasonable similarity metric is often task-speciﬁc, thus rendering this notion of fairnessuntenable across domains Chouldechova and Roth [2018]. Furthermore, there are manyinstances, such as Example 1.1, where fairness may require treating dissimilar individualssimilarly, which is not captured by this deﬁnition.

To measure the diﬀerence between sets of outcomes, we use optimal transport, which seeksthe most cost-eﬀective way to transform one probability distribution into another Peyr´e et al.[2019]. If we think of distributions as probabilities on point masses, then optimal transportﬁnds the minimal eﬀort way to move masses from one domain onto masses in another domain.The Kantorovich formulation allows the transport map to split masses between domains,meaning that the map does not need to be a bijection.Formally, given two measures µ and µ deﬁned on the measure spaces A and B , respec-tively, the optimal transport problem seeks an optimal coupling π from the set:Π( µ , µ ) = { π ∈ P ( A × B ) : π ( A × B ) = µ ( A ) for A ⊂ A ,π ( A × B ) = µ ( B ) for B ⊂ B} . (2)The coupling π is a new probability distribution over the product space A × B that assignsprobabilities to relate masses from domain A to masses in domain B . The constraints ensure In Section 4.1 we discuss disparate impact, which quantiﬁes a lack of demographic parity π match the original input distributions µ and µ . Thefull optimal transport problem with constraints is given by,min π ∈ Π( µ ,µ ) (cid:90) A×B c ( a, b ) dπ ( a, b ) , (3)where c : A × B → R is a cost function that measures how diﬃcult it is to move mass fromdomain A to B .The p th Wasserstein distance is a special case of this optimal transport problem on themetric measure space ( A , d ) with set A and distance d : A × A → R when the cost functionis c ( a, b ) = d ( a, b ) p : W p ( µ , µ ) ≡ (cid:18) inf π ∈ Π( µ ,µ ) (cid:90) A×A d ( a , a ) p dπ ( a , a ) (cid:19) p . (4)Wasserstein distance is commonly used in machine learning to measure distances betweenprobability distributions.Given data A = { a i } n i =1 from A and B = { b j } n j =1 from B , the marginal distributions canbe expressed as m = n (cid:88) i =1 p i δ a i and m = n (cid:88) j =1 q j δ b j , (5)where δ a is the Dirac measure, and the set of admissible couplings are the matricesΠ( m , m ) = { π ∈ R n × n + : π n = m , π T n = m } . (6)Intuitively, each entry π ( a i , b j ) in a coupling matrix says how much mass from point a i in A should be mapped onto point b j in B . Similarly, the cost can be described as a matrix C ∈ R n × n , where C ij = c ( a i , b j ) captures the cost of moving point a i onto point b j . Usingthis setup, the discrete optimal transport problem ismin π ∈ Π( m ,m ) (cid:104) π, C (cid:105) , (7)which can be solved with minimum cost ﬂow solvers Bonneel et al. [2011]. In this section, we introduce our optimal transport-based framework and explain how wequantify bias in individuals, subgroups, and groups.

We are interested in comparing the diﬀerences between two decision-making policies withrespect to a given set of outcomes, where outcomes are probability vectors over possiblepredicted labels. To capture more nuance in algorithmic decisions, we extend the concept ofa classiﬁer to a policy, which expresses outcomes (or predictions) as probabilities rather thanbinary labels. 5igure 1: Schematic of optimal transport applied to the school admissions example 4.1.Policies are applied to the two populations to generate a transport map between theiroutcomes.

Deﬁnition 3.1 (Policy) . A policy F : X → O ( Y ) is a mapping from the feature space tothe space of outcomes.Denote the distribution on outcomes generated by observing a policy F on the space X by µ XF . Given policies F and F on subsets A and B of X , we use the Wasserstein distanceto measure the diﬀerences in outcomes they produce: W ( µ AF , µ BF ) . (8)This parametrization provides control over what comparison the distance captures. By setting A and B to be the same population, we measure the impact of diﬀerent policies. Alternatively,we can take F and F to be the same policy to study its eﬀects on diﬀerent populations.In practice , we do not have access to the true distributions of the outcomes, so, we approx-imate the outcome distributions with uniform distributions as is standard in computationaloptimal transport Peyr´e et al. [2019] and compute W ( m A F , m B F ): m A F = 1 n n (cid:88) i =1 δ F ( a i ) and m B F = 1 n n (cid:88) j =1 δ F ( b j ) . (9)In the above empirical distributions, the outcomes are generated by applying the policies todata A = { a i } n i =1 ∼ A and B = { b j } n j =1 ∼ B . In this case, the cost matrix is computed by: C ij = (cid:107)F ( a i ) − F ( b i ) (cid:107) . (10)Figure 1 gives an overview of this process.This computation reveals the magnitude of diﬀerence between the outcomes from policies F and F . Allow π to be the optimal coupling associated with the above Wassersteindistance. This coupling is a transport plan describing how to split the mass of outcomes from F onto outcomes from F to produce similar behavior (or vice-versa). In particular, theentries of the normalized row n π i describe how to split the mass of outcome F ( a i ) onto theoutcomes { F ( b j ) } n j =1 . Since the outcome vectors result from applying policies to individualsin the datasets, we can also trace back the policy mappings and interpret the transport mapbetween individuals in the two sets. We implement this method with the Python Optimal Transport toolbox ( https://pot.readthedocs.io/en/stable/ ) .2 Measuring Individual and Group Bias While the Wasserstein distance measures how diﬀerent two sets of outcomes are, it doesnot take into account diﬀerences between the underlying populations to which each policyis applied. Thus, to compare two policies and examine their biases, we need to introducemetrics which take the feature space into account. Each outcome F ( a ) corresponds to someindividual a , so we can think of the coupling as mapping individuals in A to individuals in B through their outcomes.Allow Γ( a, π ) to be the set of all individuals that a ∈ A is mapped to under the coupling π : Γ( a, π ) = { b ∈ B : π ( a, b ) > } . (11)Suppose we are able to quantify how diﬀerent individuals are with some distance on thefeature space d : X × X → R . Then, using this set we can deﬁne a notion of comparativeindividual bias. Deﬁnition 3.2 (Individual Bias) . The bias an individual a experiences across policies ismeasured by the expectation: u X ( a ) = E b ∼ π ( a, · ) [ d ( a, b )] . (12)We can extend this deﬁnition of individual bias to subgroups and groups. Consider apartitioning of the input space X into discrete groups G = ( G , G , ...G n ) where G i ⊆ X suchthat X = (cid:83) i =1 G i . We say that the bias experienced by some group is then a sum of theindividual bias terms for members of that group. Deﬁnition 3.3 (Group Bias) . The bias (comparatively) experienced across policies for agroup G is measured by: U X ( G ) = (cid:88) g ∈ G u X ( g ) . (13) Remark 3.1.

We also can compare groups directly rather than across the whole population.If for all i, j ≤ |G| all groups are disjoint G i ∩ G j = ∅ then the group bias measurement canbe additively decomposed as U X ( G i ) = (cid:88) j ≤|G| U G j ( G i )where each U G j ( G i ) is the amount of bias measured in group G i that is the result of comparisonwith individuals in group G j .The success the above framework relies on the existence of a metric d : X × X → R , whichcan suitably compute diﬀerences between individuals. While taking the Euclidean distanceover feature space is often possible, we leave the choice in distance metric as a hyperparameterin our problem formulation to allow for diﬀerent measures of bias. We leverage this in ourexploration of recourse in 4.3. 7 Experimental Results

In this section, we demonstrate the ﬂexibility of our optimal transport framework. First,we show that the Wasserstein distance detects bias when disparate impact fails. Then, weshow that we can use the coupling matrix to recover known biases in the COMPAS dataset.Finally, we use the optimal coupling to explore recourse options for individuals, subgroups,and groups in the German credit dataset.

To motivate the Wasserstein distance for fairness evaluation, consider the ﬁctitious BlueCollege, who are looking to audit their admissions data for bias. Blue’s applicants comefrom two secondary schools: expensive private School A and public School B. Blue Collegewould like to check that their admissions policy is not biased with respect to an applicant’ssecondary school. Looking at previous admissions data, Blue College observes the followingpattern:1. For all students in School A, the probability of being accepted to Blue College is P ( Y = 1 | X = x ) = P ( Y = 1) = 0 . P ( Y = 1 | X = x ) = 1 forthe students with GPAs in the top 25% of GPAs in their class, and P ( Y = 1 | X = x ) = 0for the bottom 75%.Blue evaluates their decisions with the disparate impact criteria, which provides a measureof demographic parity. Deﬁnition 4.1 (Disparate Impact (80% Rule)) . A model is said to admit disparate impactif: P ( Y = 1 | School = B ) P ( Y = 1 | School = A ) ≤ τ = 0 . . (14)Blue college accepts 25% of the applicants from both schools, which implies that P ( Y =1 | School= A ) P ( Y =1 | School= B ) = 1, and so the model does not admit disparate impact and is considered fair with respect to this fairness deﬁnition. More generally, if P ( Y = 1 | School = B ) ≤ P ( Y =1 | School = A ) then τ = 1 implies this is the fairest possible outcome. Clearly, this model isnot fair as it puts the bottom 75% of students from School B at a disadvantage, while alsooﬀering an advantage to the top 25% of students at school B.We simulate several years of admissions according to this ﬁctitious model with A and B as the students in the two schools, and F , and F as the two described admissions patterns.In addition to the 25% rule described above, we also evaluate a 10% and 50% rule. A rulewith a lower percentage indicates that fewer students from both schools are accepted by BlueCollege, and the 0% rule indicates that no student from A or B is accepted. In Figure 2, weplot the Wasserstein distance and disparate impact score as aggregated over the years.8igure 2: Under disparate impact, all rules converge to a fair policy despite structural inequityin the admissions policy. In contrast, the Wasserstein distance recovers the inequities at allrule levels.According to disparate impact, for any given year Blue College may appear fair orunfair, but the aggregate admissions data appears fair over time, despite obvious bias inthe admissions process. Conversely, the Wasserstein distance reveals a ﬁxed amount of biasover time and shows that the observed bias decreases as the rule tends towards zero. As thedecision rule approaches the trivially fair scenario where all individuals from both schools arerejected, the Wasserstein distance decreases, indicating that the outcomes generated by thepolicies are moving closer together. We use the COMPAS Larson et al. [2016a] dataset to show that the optimal coupling canbe used to detect bias in subgroups. To audit the COMPAS tool for racial bias, PropublicaLarson et al. [2016b] collected COMPAS scores, prior arrest records, and two year re-arrestrecords in Broward County, Florida. In the accompanying report, they found that theCOMPAS tool assigned Black defendants risk scores that were disproportionately higherthan those assigned to their white counterparts, even after controlling for criminal history.As a result, Black defendants who did not recidivate were often misclassiﬁed as “high-risk,”whereas for white defendants, the opposite was true: those who did get rearrested wereoften mistakenly assessed as “low-risk.” To corroborate these results, we apply our optimaltransport framework to the COMPAS dataset.We take A = B to be the set of individuals in the COMPAS dataset and compute theWasserstein distance between the observed (two years after initial arrest) and predicted (byCOMPAS) outcomes. For both the observed and predicted cases, we compute a logisticregression model, which takes as input relevant criminal history informationand actual andpredicted recidivism labels as targets. We use these logistic models to estimate the likelihoodan individual recidivates under the two scenarios, denoted µ A F and µ A F , respectively, where F denotes the true and F the predicted recidivism policies. By computing W ( µ A F , µ A F ) weobtain the optimal coupling, which we then use to observe the diﬀerences between policies.Our framework recovers the bias documented in the Propublica study. First, we observe9igure 3: The transport map reveals how subgroups are mapped to one another from theground truth outcomes to the predicted outcomes for A. COMPAS’s predictions and B. equal opportunity classifer’s predictions. For example, a large fraction of high-risk whiteindividuals achieve classiﬁcation outcomes from COMPAS similar to low-risk Black defendantswhen compared to ground-truth recidivism. These diﬀerences are less stark in the outcomesproduced by the Equal Opportunity classiﬁer. C. The transport map recovers previouslyfound biases in age and gender in the German credit dataset.that as a result of the the discriminatory policy, the group bias score for Black defendantswho are considered high-risk is 5 .

88% higher than the Black defendants who were consideredlow-risk. For white defendants, however, we observe a 4 .

92% increase in bias in the oppositedirection; the bias experienced by white individuals who are considered low-risk is higher thanthat of white individuals considered high-risk. This ﬁnding suggests that Black individualswho are considered high-risk and white individuals who are considered low-risk are morelikely to attain similar predicted outcomes to an individual who is dissimilar to them infeature space, which is consistent with Propublica’s observations. Black individuals whoare considered high-risk often have similar criminal histories to white individuals who wereconsidered low-risk and vice versa. In Figure 3A we show the group-wise decomposition ofbias: 43.8% of the outcomes attained by low-risk white individuals are similar to outcomespredicted for high-risk Black individuals and 48.3% of high-risk Black individuals are mappedto low-risk white individuals.Next, we train a third logistic regression model on the COMPAS labels. This regressiondiﬀers from the earlier two by employing an equal opportunity constraint based on Zafar et al.[2017] to make the predicted outcomes more fair. The fair classiﬁer increases the true positiverates at test time for both the Black and white groups by 10%. Furthermore, in Figure 3B,we can see that more of the subgroups are mapped to their appropriate counterparts.For the positive-predicted group, which we originally documented as discriminatoryagainst mostly Black defendants, we see a decrease in unfairness from the original to the equalopportunity outcome distribution. Moreover, when we decompose the source of remainingbias, a larger share of individuals are mapped to defendants in their same predicted riskclass. This observed decrease in unfairness suggests that the fair classiﬁer produces outcome10igure 4: As α increases, increasing the weight of the actionable features of those with goodcredit in the interpolation (Equation 15), the distribution of the probability of receiving agood credit label for those who had previously been labeled as bad credit shifts from beingskewed towards low probabilities to a more uniform distribution.distributions that are less dependent on race than the original COMPAS predictions. To explore recourse opportunities with optimal transport, we use the German credit datasetHofmann [2000], which contains loan applications at a bank and their classiﬁcation as goodor bad credit risks. It is commonly used for algorithmic recourse investigations Gupta et al.[2019], Karimi et al. [2020b], Chen et al. [2020], Rawal and Lakkaraju [2020]. We train alogistic regression classiﬁer F on the features without the sensitive attributes (age and gender)to produce outcome probabilities for each individual and then compute the transport mapbetween those who were classiﬁed as bad creditors, A , and those who were classiﬁed as good, B , under policy F . Figure 3C shows that this map recovers age and gender discriminationthat has previously been found, namely that the policy favors older and male applicants. Inparticular 69.85% of men aged [18 , , ,

76) labeled as bad credit are disproportionately mapped to men aged [25 ,

76) labeledas good credit.Algorithmic recourse asks what actionable changes an individual can take to change theirclassiﬁcation Karimi et al. [2020a]. To understand recourse options for individuals, we breakup the features into three types: actionable (and mutable), non-actionable (but mutable), andimmutable (and non-actionable) (see Table A1 in the appendix for the full breakdown). Foreach type, we can compute the individual bias (Equation 12) while restricting the distancecomputation to features of that type. In Figure 5, we plot the distribution of the individualbiases for each type of feature when split by age and sex for those in the bad credit class.We see that men aged [25 ,

76) with a bad credit classiﬁcation have the highest actionablebias, suggesting that this group’s actionable features are the furthest from those with a goodcredit classiﬁcation. Similarly, women aged [18 ,

25) with a bad credit classiﬁcation havethe lowest actionable bias even though they receive far fewer good credit ratings. Since theactionable features mainly measure attributes that a fair classiﬁer should consider the most,credit history for example, this diﬀerence reinforces the claim that the policy discriminatesalong age and gender. For both non-actionable and immutable features, the diﬀerencesin individual bias are stronger across sex than across age. This result suggests that thisclassiﬁcation algorithm disadvantages women for attributes that they cannot change.For each individual a classiﬁed as having bad credit, we can compute Γ( a, π ), the group11igure 5: Individual bias in those labeled as bad creditors broken down by age and sex inthe German credit dataset Hofmann [2000] when distances are taken with respect to A. actionable, B. non-actionable, and C. immutable features.of people with good credit that they were mapped to under the optimal coupling π . Theproduct n (cid:80) b ∈ Γ( a,π ) π ( a, b ) b projects individual a onto their counterparts in group B . Usingthis product, we can interpolate an individual’s actionable features with those they weremapped to with weight α through(1 − α ) a actionable + αn (cid:88) b ∈ Γ( a,π ) π ( a, b ) b actionable . (15)We can then reapply policy F to an individual from the bad credit group with their originalimmutable and non-actionable features and interpolated actionable features to see how theirclassiﬁcation changes in response to these actions. Figure 4 shows how the probability of beingclassiﬁed as having good credit increases with α , implying that making actionable changesto these features can aﬀect outcomes. As α increases, the distributions shift from centeringaround low probabilities of receiving a good credit label to more uniform probabilities ofreceiving a good credit label, implying that changing these actionable features will improvean applicant’s odds of success.We can also examine changes in the actionable features of those who were reclassiﬁed tounderstand which features were the most important (see Figure A1 in the appendix). Previousstudies tend to focus on recourse accuracy (how many people were successfully reclassiﬁedas a result of actions taken) rather than which features were deemed most important, sowe cannot compare our results directly. Our results suggest that having a critical accountor credit at other banks and not having a guarantor or co-applicant most positively aﬀectclassiﬁcation and having no credits or credit repaid duly most negatively aﬀect classiﬁcation.Having a savings account has a large impact on classiﬁcation; no savings account or morethan 100 DM in a savings account positively impacts classiﬁcation, and having less than 100DM negatively aﬀects classiﬁcation. Our fairness philosophy builds on work by Kearns et al. [2018] and Friedler et al. [2016].Kearns et al. [2018] seek to reduce the divide between group and individual fairness by12onsidering statistical notions deﬁned over a large number of subgroups. We also embracea view of fairness beyond protected group attributes; however, our work diﬀers in that itprovides an interpretable framework to study bias as well as allowing notions of fairness tobe ﬂexibly instantiated. We also draw inspiration from Friedler et al. [2016], who argue thatalgorithmic fairness must consider the relationship between outcome and feature spaces. Thisphilosophy inspired our analysis of fairness in the outcome space rather than in the featurespace.Individual fairness Dwork et al. [2012] is often criticized due to its reliance on the existenceof similarity metrics on features and outcomes Chouldechova and Roth [2018]. While ourdeﬁnition of individual bias also relies on a distance between features, we are able to examinebias directly through the coupling matrix or vary the distance as a hyperparameter to fullyexplore possible relations in feature space. Thus, in our case, these distance parametersprovide meaningful ﬂexibility and insight.Several other works Silvia et al. [2020], Zehlike et al. [2020], Gordaliza et al. [2019],Jiang et al. [2019], Feldman et al. [2015], Johndrow et al. [2019], Chiappa and Pacchiano[2021] use optimal transport in the context of algorithmic fairness to compute a bias-neutralfeature representation for downstream classiﬁcation tasks using Wasserstein barycenterrepresentations. Our work is not concerned with computing fair classiﬁers, but ratherstudying and auditing for sources of bias, which most closely relates to Black et al. [2020].Through FlipTest, Black et al. [2020] use optimal transport to match individuals in diﬀerentsensitive attribute groups and then test how changing an individual’s group membership aﬀectstheir classiﬁcation. FlipTest uses the Monge formulation of optimal transport, which ﬁndsa permutation between the groups, rather than our more ﬂexible Kantorovich formulation,which can split an individual’s mass in the transport map. Additionally, their transport mapis based on distributions on features, while ours is based on distributions on outcomes. Theauthors also use the set of individuals whose classiﬁcation ﬂipped to study which features weremost important in the model. This approach is somewhat similar to our recourse experiments,but it audits the eﬀects of all of the features, not only the actionable ones. Similarly, bothTaskesen et al. [2020] and Xue et al. [2020] use optimal transport to measure changes inclassiﬁcation under perturbation to input features.

We have presented an optimal transport-based framework for auditing automated decisionmaking that allows us to explore bias in individuals, subgroups, and groups. The ﬂexibilityin our framework allows us to test and compare multiple notions of fairness simultaneously;we can use the Wasserstein distance to quantify bias or leverage the coupling matrix tostudy the structure of bias. Our notion of comparative fairness means that we can decidewhat frame of reference to use to examine bias. Through the COMPAS example, we havedemonstrated that optimal transport can investigate the impact of two diﬀerent policies(predicted recidivism and ground truth recidivism) on the same population as well as thesubgroups within that population. Furthermore, the transport map elucidates structures ofbias and allows us to quantify it. In studying the German credit dataset, we have shownthat optimal transport can audit automated decisions after they are made as well as oﬀer13otential recourse opportunities.One of the limitations of the standard optimal transport framework is that all of the massfrom one distribution must be transported onto all of the mass of the other distribution. Inthe context of fairness, this limitation means that outliers and unevenly represented subgroupsmay not be mapped to their appropriate counterparts. To address this, we could modify ourframework to use unbalanced or partial optimal transport. Additionally, computing transportmaps between large datasets can be computationally intensive, so we could use entropicallyregularized optimal transport to audit larger datasets. Future work will focus on addressingthese limitations as well as formalizing the mathematical theory and its connections toprevious deﬁnitions of fairness. 14 eferences

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In

International conference on machine learning , pages 214–223. PMLR, 2017.E. Black, S. Yeom, and M. Fredrikson. Fliptest: fairness testing via optimal transport. In

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , pages111–121, 2020.N. Bonneel, M. Van De Panne, S. Paris, and W. Heidrich. Displacement interpolation usinglagrangian mass transport. In

Proceedings of the 2011 SIGGRAPH Asia Conference , pages1–12, 2011.T. Calders, F. Kamiran, and M. Pechenizkiy. Building classiﬁers with independency con-straints. In , pages 13–18,2009. doi: 10.1109/ICDMW.2009.83.Y. Chen, J. Wang, and Y. Liu. Strategic recourse in linear classiﬁcation. arXiv preprintarXiv:2011.00355 , 2020.S. Chiappa and A. Pacchiano. Fairness with continuous optimal transport, 2021.A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivismprediction instruments, 2016.A. Chouldechova and A. Roth. The frontiers of fairness in machine learning.

CoRR ,abs/1810.08810, 2018. URL http://arxiv.org/abs/1810.08810 .S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review offair machine learning, 2018.K. W. Crenshaw.

On intersectionality: Essential writings . The New Press, 2017.C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness.In

Proceedings of the 3rd innovations in theoretical computer science conference , pages214–226, 2012.M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifyingand removing disparate impact, 2015.S. A. Friedler, C. Scheidegger, and S. Venkatasubramanian. On the (im)possibility of fairness,2016.P. Gordaliza, E. Del Barrio, G. Fabrice, and J.-M. Loubes. Obtaining fairness using optimaltransport theory. In

International Conference on Machine Learning , pages 2357–2365,2019.V. Gupta, P. Nokhiz, C. D. Roy, and S. Venkatasubramanian. Equalizing recourse acrossgroups. arXiv preprint arXiv:1909.03166 , 2019.15. Hanna, E. Denton, A. Smart, and J. Smith-Loud. Towards a critical race methodology inalgorithmic fairness. In

Proceedings of the 2020 conference on fairness, accountability, andtransparency , pages 501–512, 2020.M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning.

CoRR ,abs/1610.02413, 2016. URL http://arxiv.org/abs/1610.02413 .A. L. Hoﬀmann. Where fairness fails: data, algorithms, and the limits of antidiscriminationdiscourse.

Information, Communication & Society , 22(7):900–915, 2019. doi: 10.1080/1369118X.2019.1573912. URL https://doi.org/10.1080/1369118X.2019.1573912 .H. Hofmann. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) , 2000.L. Hu and Y. Chen. Fair classiﬁcation and social welfare. In

Proceedings of the 2020Conference on Fairness, Accountability, and Transparency , FAT* ’20, page 535–545, NewYork, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi:10.1145/3351095.3372857. URL https://doi.org/10.1145/3351095.3372857 .R. Jiang, A. Pacchiano, T. Stepleton, H. Jiang, and S. Chiappa. Wasserstein fair classiﬁcation,2019.J. E. Johndrow, K. Lum, et al. An algorithm for removing sensitive information: applicationto race-independent recidivism prediction.

The Annals of Applied Statistics , 13(1):189–220,2019.A.-H. Karimi, G. Barthe, B. Sch¨olkopf, and I. Valera. A survey of algorithmic recourse:deﬁnitions, formulations, solutions, and prospects. arXiv preprint arXiv:2010.04050 , 2020a.A.-H. Karimi, B. Sch¨olkopf, and I. Valera. Algorithmic recourse: from counterfactualexplanations to interventions. arXiv preprint arXiv:2002.06278 , 2020b.M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditingand learning for subgroup fairness. In J. Dy and A. Krause, editors,

Proceedings of the35th International Conference on Machine Learning , volume 80 of

Proceedings of MachineLearning Research , pages 2564–2572, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul2018. PMLR. URL http://proceedings.mlr.press/v80/kearns18a.html .J. Larson, S. Mattu, L. Kirchner, and J. Angwin. https://github.com/propublica/compas-analysis , 2016a.J. Larson, S. Mattu, L. Kirchner, and J. Angwin. How we analyzed the compas recidivismalgorithm.

ProPublica (5 2016) , 9(1), 2016b.M. Lohaus, M. Perrot, and U. V. Luxburg. Too relaxed to be fair. In H. D. III and A. Singh,editors,

Proceedings of the 37th International Conference on Machine Learning , volume 119of

Proceedings of Machine Learning Research , pages 6360–6369. PMLR, 13–18 Jul 2020.16. Lum and W. Isaac. To predict and serve?

Signiﬁcance , 13(5):14–19, 2016. doi: https://doi.org/10.1111/j.1740-9713.2016.00960.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1740-9713.2016.00960.x .L. Munkhdalai, T. Munkhdalai, O.-E. Namsrai, J. Y. Lee, and K. H. Ryu. An empiricalcomparison of machine-learning methods on bank client credit assessments.

Sustainability ,11(3), 2019. ISSN 2071-1050. doi: 10.3390/su11030699. URL .G. Peyr´e, M. Cuturi, et al. Computational optimal transport.

Foundations and Trends inMachine Learning , 11(5-6):355–607, 2019.K. Rawal and H. Lakkaraju. Interpretable and interactive summaries ofactionable recourses. arXiv preprint arXiv:2009.07165 , 2020.A. D. Selbst, D. Boyd, S. A. Friedler, S. Venkatasubramanian, and J. Vertesi. Fairnessand abstraction in sociotechnical systems. In

Proceedings of the conference on fairness,accountability, and transparency , pages 59–68, 2019.C. Silvia, J. Ray, S. Tom, P. Aldo, J. Heinrich, and A. John. A general approach to fairnesswith optimal transport. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 34, pages 3633–3640, 2020.B. Taskesen, J. Blanchet, D. Kuhn, and V. A. Nguyen. A statistical test for probabilisticfairness, 2020.B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatorypredictors. In

Conference on Learning Theory , pages 1920–1953. PMLR, 2017.Y. Wu, L. Zhang, and X. Wu. On convexity and bounds of fairness-aware classiﬁcation. In

The World Wide Web Conference , WWW ’19, page 3356–3362, New York, NY, USA, 2019.Association for Computing Machinery. ISBN 9781450366748. doi: 10.1145/3308558.3313723.URL https://doi.org/10.1145/3308558.3313723 .S. Xue, M. Yurochkin, and Y. Sun. Auditing ml models for individual bias and unfairness,2020.M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: Mecha-nisms for fair classiﬁcation, 2017.M. Zehlike, P. Hacker, and E. Wiedemann. Matching code and law: achieving algorithmicfairness with optimal transport.

Data Mining and Knowledge Discovery , 34(1):163–200,2020. 17 ppendix

Computing Individual and Group Bias

Our deﬁnitions of individual and group bias assume that we can measure the entire population X . In practice, we have data A = { a i } n i =1 ∼ A and B = { b j } n j =1 ∼ B , which requires slightchanges for computations. Deﬁnition 0.1 (Individual Bias) . The bias an individual a from group A experiences incomparison to group B across policies is measured by the expectation: u B ( a ) = (cid:88) b ∈ Γ( a,π ) d ( a, b ) π ( a, b ) . (16)We need to alter the deﬁnition of group bias to account for the fact that we can onlycompare an individual a ∈ G ⊂ A to their counterparts in B . Deﬁnition 0.2 (Group Bias) . The bias (comparatively) experienced across policies for agroup G within population A when compared to population B is measured by: U B ( A ∩ G ) = (cid:88) g ∈ A ∩ G u B ( g ) . (17)Finally, if we decompose the datasets A and B into disjoint groups A = { A , A , . . . A n } and B = { B , B , . . . , B m } , then the group bias measurement in population A relative topopulation B can be additively decomposed as U B ( A i ) = (cid:88) j ≤ m U B j ( A i )where U B j ( A i ) is the amount of bias measured in group A i within population A that is theresult of comparison with individuals in group B j within population B . We make use of thisframework to compare how individuals in the COMPAS dataset are mapped across race andcriminal history-based groups in Section 4.2 and how individuals in the German credit dataare mapped across age and sex-based groups in Section 4.3.In the case that an individual a in group A is very diﬀerent from everyone in group B , theindividual bias in Deﬁnition 3.2 will be high regardless of who a is mapped to. Thus, to detectsuch outliers, we can normalize the individual bias by the individual’s own worst-case-scenario,i.e. when they have all of their mass mapped to the individual who is farthest from them: u ∗ B ( a ) = n (cid:80) n j =1 π ( a, b j ) d ( a, b j )max j =1: n d ( a, b j ) . (18)This new normalized individual bias can then be used to diﬀerentiate between cases when anindividual experiences extreme bias or merely has no equivalent comparison in feature space.18ariable GroupSavings ActionableCredit history ActionableOther installments ActionableCo-applicant or guarantor ActionableNumber liable individuals ActionableUnemployed ActionableProperty owned ActionableNumber existing credits ActionableInstallment rate ActionableCredit amount Non-actionableCredit duration Non-actionableCredit purpose Non-actionableTelephone Non-actionableSex ImmutableAge ImmuatbleForeign national ImmutableYears at job ImmutableSkilled employee ImmutalbePermanent resident since ImmutableTable A1: Actionable, non-actionable, and immutable features in the German credit dataset. COMPAS dataset

COMPAS is a commercial tool used to predict recidivism risk of individuals awaiting trial.The COMPAS dataset contains data compiled by Propublica Larson et al. [2016b] on 6172people who were scored by the COMPAS algorithm in Broward County, Florida from 2013-2014. Each person has a predicted recidivism score as well as a binary ﬂag to indicate whetheror not they recidivated. There are 3,175 Black defendants, 2,103 white defendants, and 2,809defendants who recidivated within two years in this sample. It contains 9 features: two yearrecidivism, number of priors, age above 45, age below 25, female, misdemeanor, ethnicity, andpredicted recidivism probability. We one-hot encode the categorical variables and normalizethe features prior to ﬁtting our models.

German credit dataset

The German credit dataset Hofmann [2000] contains 1000 instances with 26 features each ofloan applications at a bank. It includes applicant proﬁles as well as their classiﬁcation as abad ( n = 300) or good ( n = 700) credit risk. We one-hot encode categorical variables andthen z-score normalize all features before ﬁtting our logistic regression model.Table A1 describes how we classiﬁed features as actionable, non-actionable, and immutableas inspired by Chen et al. [2020].For each individual labeled as having bad credit, we interpolate their actionable featureswith the actionable features of those they were mapped to (Equation 15). Then, we reclassify19hese individuals. In Figure A1, we present the percentage of individuals who were reclassiﬁedas having good credit and record the average change for each actionable feature for varyingvalues of α .Figure A1: Average changes in actionable features in the German credit dataset afterindividuals with bad credit were interpolated (at level αα