Everything is Relative: Understanding Fairness with Optimal Transport
EEverything is Relative: Understanding Fairness withOptimal Transport
Kweku Kwegyir-Aggrey , Rebecca Santorella , and Sarah M. Brown Department of Computer Science, Brown University Division of Applied Mathematics, Brown University Department of Computer Science, University of Rhode Island
Abstract
To study discrimination in automated decision-making systems, scholars have pro-posed several definitions of fairness, each expressing a different fair ideal. Thesedefinitions require practitioners to make complex decisions regarding which notion toemploy and are often difficult to use in practice since they make a binary judgement asystem is fair or unfair instead of explaining the structure of the detected unfairness.We present an optimal transport-based approach to fairness that offers an interpretableand quantifiable exploration of bias and its structure by comparing a pair of outcomesto one another. In this work, we use the optimal transport map to examine individual,subgroup, and group fairness. Our framework is able to recover well known examplesof algorithmic discrimination, detect unfairness when other metrics fail, and explorerecourse opportunities.
Machine learning-based decisions in sensitive domains such as criminal justice Lum andIsaac [2016] or banking Munkhdalai et al. [2019] have come under significant scrutiny dueto their ability to amplify, re-invent, and create patterns of discriminatory behavior againstalready marginalized and vulnerable groups. Many scholars have attempted to preventalgorithmic discrimination by engineering fair algorithmic solutions. These efforts Zafar et al.[2017], Hardt et al. [2016], Woodworth et al. [2017] rely on the assumption that if it can beproven or demonstrated that a classifier exhibits fair behavior with respect to some agreeablenotion of fairness, this classifier will no longer produce discriminatory outcomes. In contrast,critical scholars instead assert that discrimination by algorithmic systems is a symptom ofbroader systematic and historic forms of oppression Selbst et al. [2019], which cannot beameliorated through fair algorithmic solutionism. We suggest that the key tension betweenthese viewpoints relies on resolving the following question:
If machine learning can be usedto examine the discrimination present in algorithmic solutions, how can bias be meaningfullyquantified and studied within this context to prevent further discriminatory outcomes? In this framing, we view bias as measurement of discrimination relative to some fair ideal. a r X i v : . [ c s . C Y ] F e b espite countless examples of bias in automated decision making, a precise definitionof fairness in this context remains elusive. Philosophical and legal doctrine have proposedseveral conceptualizations of fairness such as disparate impact and equal opportunity, but nosingle fair notion has become standard or ubiquitous. Instead, each fairness notion exists todescribe idealized fair behavior in a different context or application. The machine learningcommunity approaches fairness similarly, usually by translating some abstract notion into amathematical expression and then declaring classifiers as fair if they satisfy said expression.We consider two types of fair criteria: individual fairness Dwork et al. [2012] and groupfairness. The first postulates that similar individuals should be treated similarly, regardlessof protected group membership such as race or gender, while the second considers the contextof these protected attributes and looks to achieve some notion of parity between groups. Forexample, the group fairness notion of demographic parity suggests that fairness is achievedwhen a classifier’s predictions are independent of individuals’ protected attributes.Although these fair definitions are tractable and succinct, they often lack sufficient nuanceto examine fairness in the context of real sociotechnical systems Hoffmann [2019]. Forexample, it has been shown that binary assumptions about protected group membership limitthe applicability of fair algorithms Hanna et al. [2020]. While these assumptions make somefairness problems tractable, Kearns et al. [2018] demonstrated that they are not sufficient tocompute fair outcomes over more complex groups such as a set of protected groups that mayintersect . Moreover, these fairness definitions can suggest that a system is fair or unfairbut are unable to provide actionable recourse as to how to amend the bias, further renderingthese definitions deficient in practice Corbett-Davies and Goel [2018].Our solution is to evoke an understanding of fairness that does not rely on closed-formdefinitions. Instead, we propose that a set of outcomes can only be deemed unfair if itdeviates strongly from a set of outcomes that are deemed fair. In other words, fairness is aproperty that can be evaluated by comparing sets of outcomes. This intuition can be madetractable by viewing sets of outcomes as probability distributions and then leveraging somestatistical difference measure to quantify the differences between outcomes. Example 1.1.
A university admissions process may be considered unfair if it admits cohortsthat are not sufficiently representative or diverse. This complaint follows from comparingthe admitted cohort to another, possibly hypothetical, fair cohort. This notion of fairnessdoes not necessarily require parity in classification - not all student groups must be admittedwith the same frequency - which differentiates this intuition from current fair algorithmicpractices.We advocate for the Wasserstein distance as the best statistical distance metric to compareoutcome distributions. The Wasserstein distance has many advantages over other distributionmetrics such as KL-Divergence or Total Variation distance (we refer the reader to Arjovskyet al. [2017] for a summary of these advantages); however, there are two advantages we wouldlike to highlight. First, the Wasserstein distance can be computed between distributionsthat are not defined over the same supports. This flexibility means that we can compareunfairness across two seemingly disparate groups of observations. Additionally, computing the Our intersectional understanding of identity is based in the notion of Intersectionality Theory Crenshaw[2017]
To circumvent the limitations of the definition-based fairness approach, we provide an optimaltransport-based framework for exploring and quantifying bias in automated systems. Wecompute the Wasserstein distance between outcomes for different groups to understand howone set of outcomes may be transformed into another. This distance is able to detect biaswhen other fairness metrics fail, and we can leverage the transport map used in the distancecalculation to quantify individual, subgroup, and group bias. This approach allows us torecover biases previously found in the COMPAS and German credit datasets, as well as offerpossible actions for recourse.
In this section, we introduce key definitions in fairness and relevant mathematical backgroundon optimal transport.
Allow
X ⊆ R k to be some set of features and Y = { , } to be a binary set of possibleoutcomes for some decision. We assume that S = { , } describes the protected groupmembership of an individual. A classifier is a function h : X → { , } which outputs aclassification decision h ( x ) for each individual x ∈ X . We assume there exists a distribution D over the set Z = X × S × Y , which takes on values ( x, s, y ) ∼ D . Below we review common definitions of group fairness that we will use as points of comparison.
Definition 2.1 (Demographic Parity) . A classifier is fair under demographic parity if theclassifier’s predictions are independent from the sensitive attribute Calders et al. [2009]: E D [ h ( X ) |S = 1] = E D [ h ( X ) |S = 0] . (1)Chouldechova [2016] demonstrated that satisfying demographic parity requires eachprotected group to have the same distribution of features. Most data distributions violatethis condition, and in that case, enforcing demographic parity can have an adverse effect onclassifier performance Zafar et al. [2017], Lohaus et al. [2020], Hu and Chen [2020]. To avoidthis performance loss, difference of demographic parity (DDP) Wu et al. [2019] relaxes thestrict independence assumption to, DDP ( f ) := E D [ h ( X ) |S = 1] − E D [ h ( X ) |S = 0] , , and does not provide additional information regarding thestructure of observed unfairness.Equal opportunity Hardt et al. [2016] achieves fairness by equalizing the per-group truepositive rates of a classifier. Definition 2.2 (Equal Opportunity) . A classifier is fair under equal opportunity if itspredictions have the same true positive rates across sensitive attribute membership: E D [ h ( X ) | Y = 1 , S = 1] = E D [ h ( X ) | Y = 1 , S = 0] . Although equal opportunity differs from demographic parity in that it can be enforceddespite difference in group conditioned feature distributions, the two expressions share manyof the same limitations such as the inability to examine the structure of unfairness. We referthe reader to Corbett-Davies and Goel [2018] for a more complete analysis of parity-basedfairness definitions.
Individual fairness Dwork et al. [2012] is summarized by the intuition that “similar individualsshould be treated similarly.” Although this definition is semantically appealing, choosinga reasonable similarity metric is often task-specific, thus rendering this notion of fairnessuntenable across domains Chouldechova and Roth [2018]. Furthermore, there are manyinstances, such as Example 1.1, where fairness may require treating dissimilar individualssimilarly, which is not captured by this definition.
To measure the difference between sets of outcomes, we use optimal transport, which seeksthe most cost-effective way to transform one probability distribution into another Peyr´e et al.[2019]. If we think of distributions as probabilities on point masses, then optimal transportfinds the minimal effort way to move masses from one domain onto masses in another domain.The Kantorovich formulation allows the transport map to split masses between domains,meaning that the map does not need to be a bijection.Formally, given two measures µ and µ defined on the measure spaces A and B , respec-tively, the optimal transport problem seeks an optimal coupling π from the set:Π( µ , µ ) = { π ∈ P ( A × B ) : π ( A × B ) = µ ( A ) for A ⊂ A ,π ( A × B ) = µ ( B ) for B ⊂ B} . (2)The coupling π is a new probability distribution over the product space A × B that assignsprobabilities to relate masses from domain A to masses in domain B . The constraints ensure In Section 4.1 we discuss disparate impact, which quantifies a lack of demographic parity π match the original input distributions µ and µ . Thefull optimal transport problem with constraints is given by,min π ∈ Π( µ ,µ ) (cid:90) A×B c ( a, b ) dπ ( a, b ) , (3)where c : A × B → R is a cost function that measures how difficult it is to move mass fromdomain A to B .The p th Wasserstein distance is a special case of this optimal transport problem on themetric measure space ( A , d ) with set A and distance d : A × A → R when the cost functionis c ( a, b ) = d ( a, b ) p : W p ( µ , µ ) ≡ (cid:18) inf π ∈ Π( µ ,µ ) (cid:90) A×A d ( a , a ) p dπ ( a , a ) (cid:19) p . (4)Wasserstein distance is commonly used in machine learning to measure distances betweenprobability distributions.Given data A = { a i } n i =1 from A and B = { b j } n j =1 from B , the marginal distributions canbe expressed as m = n (cid:88) i =1 p i δ a i and m = n (cid:88) j =1 q j δ b j , (5)where δ a is the Dirac measure, and the set of admissible couplings are the matricesΠ( m , m ) = { π ∈ R n × n + : π n = m , π T n = m } . (6)Intuitively, each entry π ( a i , b j ) in a coupling matrix says how much mass from point a i in A should be mapped onto point b j in B . Similarly, the cost can be described as a matrix C ∈ R n × n , where C ij = c ( a i , b j ) captures the cost of moving point a i onto point b j . Usingthis setup, the discrete optimal transport problem ismin π ∈ Π( m ,m ) (cid:104) π, C (cid:105) , (7)which can be solved with minimum cost flow solvers Bonneel et al. [2011]. In this section, we introduce our optimal transport-based framework and explain how wequantify bias in individuals, subgroups, and groups.
We are interested in comparing the differences between two decision-making policies withrespect to a given set of outcomes, where outcomes are probability vectors over possiblepredicted labels. To capture more nuance in algorithmic decisions, we extend the concept ofa classifier to a policy, which expresses outcomes (or predictions) as probabilities rather thanbinary labels. 5igure 1: Schematic of optimal transport applied to the school admissions example 4.1.Policies are applied to the two populations to generate a transport map between theiroutcomes.
Definition 3.1 (Policy) . A policy F : X → O ( Y ) is a mapping from the feature space tothe space of outcomes.Denote the distribution on outcomes generated by observing a policy F on the space X by µ XF . Given policies F and F on subsets A and B of X , we use the Wasserstein distanceto measure the differences in outcomes they produce: W ( µ AF , µ BF ) . (8)This parametrization provides control over what comparison the distance captures. By setting A and B to be the same population, we measure the impact of different policies. Alternatively,we can take F and F to be the same policy to study its effects on different populations.In practice , we do not have access to the true distributions of the outcomes, so, we approx-imate the outcome distributions with uniform distributions as is standard in computationaloptimal transport Peyr´e et al. [2019] and compute W ( m A F , m B F ): m A F = 1 n n (cid:88) i =1 δ F ( a i ) and m B F = 1 n n (cid:88) j =1 δ F ( b j ) . (9)In the above empirical distributions, the outcomes are generated by applying the policies todata A = { a i } n i =1 ∼ A and B = { b j } n j =1 ∼ B . In this case, the cost matrix is computed by: C ij = (cid:107)F ( a i ) − F ( b i ) (cid:107) . (10)Figure 1 gives an overview of this process.This computation reveals the magnitude of difference between the outcomes from policies F and F . Allow π to be the optimal coupling associated with the above Wassersteindistance. This coupling is a transport plan describing how to split the mass of outcomes from F onto outcomes from F to produce similar behavior (or vice-versa). In particular, theentries of the normalized row n π i describe how to split the mass of outcome F ( a i ) onto theoutcomes { F ( b j ) } n j =1 . Since the outcome vectors result from applying policies to individualsin the datasets, we can also trace back the policy mappings and interpret the transport mapbetween individuals in the two sets. We implement this method with the Python Optimal Transport toolbox ( https://pot.readthedocs.io/en/stable/ ) .2 Measuring Individual and Group Bias While the Wasserstein distance measures how different two sets of outcomes are, it doesnot take into account differences between the underlying populations to which each policyis applied. Thus, to compare two policies and examine their biases, we need to introducemetrics which take the feature space into account. Each outcome F ( a ) corresponds to someindividual a , so we can think of the coupling as mapping individuals in A to individuals in B through their outcomes.Allow Γ( a, π ) to be the set of all individuals that a ∈ A is mapped to under the coupling π : Γ( a, π ) = { b ∈ B : π ( a, b ) > } . (11)Suppose we are able to quantify how different individuals are with some distance on thefeature space d : X × X → R . Then, using this set we can define a notion of comparativeindividual bias. Definition 3.2 (Individual Bias) . The bias an individual a experiences across policies ismeasured by the expectation: u X ( a ) = E b ∼ π ( a, · ) [ d ( a, b )] . (12)We can extend this definition of individual bias to subgroups and groups. Consider apartitioning of the input space X into discrete groups G = ( G , G , ...G n ) where G i ⊆ X suchthat X = (cid:83) i =1 G i . We say that the bias experienced by some group is then a sum of theindividual bias terms for members of that group. Definition 3.3 (Group Bias) . The bias (comparatively) experienced across policies for agroup G is measured by: U X ( G ) = (cid:88) g ∈ G u X ( g ) . (13) Remark 3.1.
We also can compare groups directly rather than across the whole population.If for all i, j ≤ |G| all groups are disjoint G i ∩ G j = ∅ then the group bias measurement canbe additively decomposed as U X ( G i ) = (cid:88) j ≤|G| U G j ( G i )where each U G j ( G i ) is the amount of bias measured in group G i that is the result of comparisonwith individuals in group G j .The success the above framework relies on the existence of a metric d : X × X → R , whichcan suitably compute differences between individuals. While taking the Euclidean distanceover feature space is often possible, we leave the choice in distance metric as a hyperparameterin our problem formulation to allow for different measures of bias. We leverage this in ourexploration of recourse in 4.3. 7 Experimental Results
In this section, we demonstrate the flexibility of our optimal transport framework. First,we show that the Wasserstein distance detects bias when disparate impact fails. Then, weshow that we can use the coupling matrix to recover known biases in the COMPAS dataset.Finally, we use the optimal coupling to explore recourse options for individuals, subgroups,and groups in the German credit dataset.
To motivate the Wasserstein distance for fairness evaluation, consider the fictitious BlueCollege, who are looking to audit their admissions data for bias. Blue’s applicants comefrom two secondary schools: expensive private School A and public School B. Blue Collegewould like to check that their admissions policy is not biased with respect to an applicant’ssecondary school. Looking at previous admissions data, Blue College observes the followingpattern:1. For all students in School A, the probability of being accepted to Blue College is P ( Y = 1 | X = x ) = P ( Y = 1) = 0 . P ( Y = 1 | X = x ) = 1 forthe students with GPAs in the top 25% of GPAs in their class, and P ( Y = 1 | X = x ) = 0for the bottom 75%.Blue evaluates their decisions with the disparate impact criteria, which provides a measureof demographic parity. Definition 4.1 (Disparate Impact (80% Rule)) . A model is said to admit disparate impactif: P ( Y = 1 | School = B ) P ( Y = 1 | School = A ) ≤ τ = 0 . . (14)Blue college accepts 25% of the applicants from both schools, which implies that P ( Y =1 | School= A ) P ( Y =1 | School= B ) = 1, and so the model does not admit disparate impact and is considered fair with respect to this fairness definition. More generally, if P ( Y = 1 | School = B ) ≤ P ( Y =1 | School = A ) then τ = 1 implies this is the fairest possible outcome. Clearly, this model isnot fair as it puts the bottom 75% of students from School B at a disadvantage, while alsooffering an advantage to the top 25% of students at school B.We simulate several years of admissions according to this fictitious model with A and B as the students in the two schools, and F , and F as the two described admissions patterns.In addition to the 25% rule described above, we also evaluate a 10% and 50% rule. A rulewith a lower percentage indicates that fewer students from both schools are accepted by BlueCollege, and the 0% rule indicates that no student from A or B is accepted. In Figure 2, weplot the Wasserstein distance and disparate impact score as aggregated over the years.8igure 2: Under disparate impact, all rules converge to a fair policy despite structural inequityin the admissions policy. In contrast, the Wasserstein distance recovers the inequities at allrule levels.According to disparate impact, for any given year Blue College may appear fair orunfair, but the aggregate admissions data appears fair over time, despite obvious bias inthe admissions process. Conversely, the Wasserstein distance reveals a fixed amount of biasover time and shows that the observed bias decreases as the rule tends towards zero. As thedecision rule approaches the trivially fair scenario where all individuals from both schools arerejected, the Wasserstein distance decreases, indicating that the outcomes generated by thepolicies are moving closer together. We use the COMPAS Larson et al. [2016a] dataset to show that the optimal coupling canbe used to detect bias in subgroups. To audit the COMPAS tool for racial bias, PropublicaLarson et al. [2016b] collected COMPAS scores, prior arrest records, and two year re-arrestrecords in Broward County, Florida. In the accompanying report, they found that theCOMPAS tool assigned Black defendants risk scores that were disproportionately higherthan those assigned to their white counterparts, even after controlling for criminal history.As a result, Black defendants who did not recidivate were often misclassified as “high-risk,”whereas for white defendants, the opposite was true: those who did get rearrested wereoften mistakenly assessed as “low-risk.” To corroborate these results, we apply our optimaltransport framework to the COMPAS dataset.We take A = B to be the set of individuals in the COMPAS dataset and compute theWasserstein distance between the observed (two years after initial arrest) and predicted (byCOMPAS) outcomes. For both the observed and predicted cases, we compute a logisticregression model, which takes as input relevant criminal history informationand actual andpredicted recidivism labels as targets. We use these logistic models to estimate the likelihoodan individual recidivates under the two scenarios, denoted µ A F and µ A F , respectively, where F denotes the true and F the predicted recidivism policies. By computing W ( µ A F , µ A F ) weobtain the optimal coupling, which we then use to observe the differences between policies.Our framework recovers the bias documented in the Propublica study. First, we observe9igure 3: The transport map reveals how subgroups are mapped to one another from theground truth outcomes to the predicted outcomes for A. COMPAS’s predictions and B. equal opportunity classifer’s predictions. For example, a large fraction of high-risk whiteindividuals achieve classification outcomes from COMPAS similar to low-risk Black defendantswhen compared to ground-truth recidivism. These differences are less stark in the outcomesproduced by the Equal Opportunity classifier. C. The transport map recovers previouslyfound biases in age and gender in the German credit dataset.that as a result of the the discriminatory policy, the group bias score for Black defendantswho are considered high-risk is 5 .
88% higher than the Black defendants who were consideredlow-risk. For white defendants, however, we observe a 4 .
92% increase in bias in the oppositedirection; the bias experienced by white individuals who are considered low-risk is higher thanthat of white individuals considered high-risk. This finding suggests that Black individualswho are considered high-risk and white individuals who are considered low-risk are morelikely to attain similar predicted outcomes to an individual who is dissimilar to them infeature space, which is consistent with Propublica’s observations. Black individuals whoare considered high-risk often have similar criminal histories to white individuals who wereconsidered low-risk and vice versa. In Figure 3A we show the group-wise decomposition ofbias: 43.8% of the outcomes attained by low-risk white individuals are similar to outcomespredicted for high-risk Black individuals and 48.3% of high-risk Black individuals are mappedto low-risk white individuals.Next, we train a third logistic regression model on the COMPAS labels. This regressiondiffers from the earlier two by employing an equal opportunity constraint based on Zafar et al.[2017] to make the predicted outcomes more fair. The fair classifier increases the true positiverates at test time for both the Black and white groups by 10%. Furthermore, in Figure 3B,we can see that more of the subgroups are mapped to their appropriate counterparts.For the positive-predicted group, which we originally documented as discriminatoryagainst mostly Black defendants, we see a decrease in unfairness from the original to the equalopportunity outcome distribution. Moreover, when we decompose the source of remainingbias, a larger share of individuals are mapped to defendants in their same predicted riskclass. This observed decrease in unfairness suggests that the fair classifier produces outcome10igure 4: As α increases, increasing the weight of the actionable features of those with goodcredit in the interpolation (Equation 15), the distribution of the probability of receiving agood credit label for those who had previously been labeled as bad credit shifts from beingskewed towards low probabilities to a more uniform distribution.distributions that are less dependent on race than the original COMPAS predictions. To explore recourse opportunities with optimal transport, we use the German credit datasetHofmann [2000], which contains loan applications at a bank and their classification as goodor bad credit risks. It is commonly used for algorithmic recourse investigations Gupta et al.[2019], Karimi et al. [2020b], Chen et al. [2020], Rawal and Lakkaraju [2020]. We train alogistic regression classifier F on the features without the sensitive attributes (age and gender)to produce outcome probabilities for each individual and then compute the transport mapbetween those who were classified as bad creditors, A , and those who were classified as good, B , under policy F . Figure 3C shows that this map recovers age and gender discriminationthat has previously been found, namely that the policy favors older and male applicants. Inparticular 69.85% of men aged [18 , , ,
76) labeled as bad credit are disproportionately mapped to men aged [25 ,
76) labeledas good credit.Algorithmic recourse asks what actionable changes an individual can take to change theirclassification Karimi et al. [2020a]. To understand recourse options for individuals, we breakup the features into three types: actionable (and mutable), non-actionable (but mutable), andimmutable (and non-actionable) (see Table A1 in the appendix for the full breakdown). Foreach type, we can compute the individual bias (Equation 12) while restricting the distancecomputation to features of that type. In Figure 5, we plot the distribution of the individualbiases for each type of feature when split by age and sex for those in the bad credit class.We see that men aged [25 ,
76) with a bad credit classification have the highest actionablebias, suggesting that this group’s actionable features are the furthest from those with a goodcredit classification. Similarly, women aged [18 ,
25) with a bad credit classification havethe lowest actionable bias even though they receive far fewer good credit ratings. Since theactionable features mainly measure attributes that a fair classifier should consider the most,credit history for example, this difference reinforces the claim that the policy discriminatesalong age and gender. For both non-actionable and immutable features, the differencesin individual bias are stronger across sex than across age. This result suggests that thisclassification algorithm disadvantages women for attributes that they cannot change.For each individual a classified as having bad credit, we can compute Γ( a, π ), the group11igure 5: Individual bias in those labeled as bad creditors broken down by age and sex inthe German credit dataset Hofmann [2000] when distances are taken with respect to A. actionable, B. non-actionable, and C. immutable features.of people with good credit that they were mapped to under the optimal coupling π . Theproduct n (cid:80) b ∈ Γ( a,π ) π ( a, b ) b projects individual a onto their counterparts in group B . Usingthis product, we can interpolate an individual’s actionable features with those they weremapped to with weight α through(1 − α ) a actionable + αn (cid:88) b ∈ Γ( a,π ) π ( a, b ) b actionable . (15)We can then reapply policy F to an individual from the bad credit group with their originalimmutable and non-actionable features and interpolated actionable features to see how theirclassification changes in response to these actions. Figure 4 shows how the probability of beingclassified as having good credit increases with α , implying that making actionable changesto these features can affect outcomes. As α increases, the distributions shift from centeringaround low probabilities of receiving a good credit label to more uniform probabilities ofreceiving a good credit label, implying that changing these actionable features will improvean applicant’s odds of success.We can also examine changes in the actionable features of those who were reclassified tounderstand which features were the most important (see Figure A1 in the appendix). Previousstudies tend to focus on recourse accuracy (how many people were successfully reclassifiedas a result of actions taken) rather than which features were deemed most important, sowe cannot compare our results directly. Our results suggest that having a critical accountor credit at other banks and not having a guarantor or co-applicant most positively affectclassification and having no credits or credit repaid duly most negatively affect classification.Having a savings account has a large impact on classification; no savings account or morethan 100 DM in a savings account positively impacts classification, and having less than 100DM negatively affects classification. Our fairness philosophy builds on work by Kearns et al. [2018] and Friedler et al. [2016].Kearns et al. [2018] seek to reduce the divide between group and individual fairness by12onsidering statistical notions defined over a large number of subgroups. We also embracea view of fairness beyond protected group attributes; however, our work differs in that itprovides an interpretable framework to study bias as well as allowing notions of fairness tobe flexibly instantiated. We also draw inspiration from Friedler et al. [2016], who argue thatalgorithmic fairness must consider the relationship between outcome and feature spaces. Thisphilosophy inspired our analysis of fairness in the outcome space rather than in the featurespace.Individual fairness Dwork et al. [2012] is often criticized due to its reliance on the existenceof similarity metrics on features and outcomes Chouldechova and Roth [2018]. While ourdefinition of individual bias also relies on a distance between features, we are able to examinebias directly through the coupling matrix or vary the distance as a hyperparameter to fullyexplore possible relations in feature space. Thus, in our case, these distance parametersprovide meaningful flexibility and insight.Several other works Silvia et al. [2020], Zehlike et al. [2020], Gordaliza et al. [2019],Jiang et al. [2019], Feldman et al. [2015], Johndrow et al. [2019], Chiappa and Pacchiano[2021] use optimal transport in the context of algorithmic fairness to compute a bias-neutralfeature representation for downstream classification tasks using Wasserstein barycenterrepresentations. Our work is not concerned with computing fair classifiers, but ratherstudying and auditing for sources of bias, which most closely relates to Black et al. [2020].Through FlipTest, Black et al. [2020] use optimal transport to match individuals in differentsensitive attribute groups and then test how changing an individual’s group membership affectstheir classification. FlipTest uses the Monge formulation of optimal transport, which findsa permutation between the groups, rather than our more flexible Kantorovich formulation,which can split an individual’s mass in the transport map. Additionally, their transport mapis based on distributions on features, while ours is based on distributions on outcomes. Theauthors also use the set of individuals whose classification flipped to study which features weremost important in the model. This approach is somewhat similar to our recourse experiments,but it audits the effects of all of the features, not only the actionable ones. Similarly, bothTaskesen et al. [2020] and Xue et al. [2020] use optimal transport to measure changes inclassification under perturbation to input features.
We have presented an optimal transport-based framework for auditing automated decisionmaking that allows us to explore bias in individuals, subgroups, and groups. The flexibilityin our framework allows us to test and compare multiple notions of fairness simultaneously;we can use the Wasserstein distance to quantify bias or leverage the coupling matrix tostudy the structure of bias. Our notion of comparative fairness means that we can decidewhat frame of reference to use to examine bias. Through the COMPAS example, we havedemonstrated that optimal transport can investigate the impact of two different policies(predicted recidivism and ground truth recidivism) on the same population as well as thesubgroups within that population. Furthermore, the transport map elucidates structures ofbias and allows us to quantify it. In studying the German credit dataset, we have shownthat optimal transport can audit automated decisions after they are made as well as offer13otential recourse opportunities.One of the limitations of the standard optimal transport framework is that all of the massfrom one distribution must be transported onto all of the mass of the other distribution. Inthe context of fairness, this limitation means that outliers and unevenly represented subgroupsmay not be mapped to their appropriate counterparts. To address this, we could modify ourframework to use unbalanced or partial optimal transport. Additionally, computing transportmaps between large datasets can be computationally intensive, so we could use entropicallyregularized optimal transport to audit larger datasets. Future work will focus on addressingthese limitations as well as formalizing the mathematical theory and its connections toprevious definitions of fairness. 14 eferences
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In
International conference on machine learning , pages 214–223. PMLR, 2017.E. Black, S. Yeom, and M. Fredrikson. Fliptest: fairness testing via optimal transport. In
Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , pages111–121, 2020.N. Bonneel, M. Van De Panne, S. Paris, and W. Heidrich. Displacement interpolation usinglagrangian mass transport. In
Proceedings of the 2011 SIGGRAPH Asia Conference , pages1–12, 2011.T. Calders, F. Kamiran, and M. Pechenizkiy. Building classifiers with independency con-straints. In , pages 13–18,2009. doi: 10.1109/ICDMW.2009.83.Y. Chen, J. Wang, and Y. Liu. Strategic recourse in linear classification. arXiv preprintarXiv:2011.00355 , 2020.S. Chiappa and A. Pacchiano. Fairness with continuous optimal transport, 2021.A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivismprediction instruments, 2016.A. Chouldechova and A. Roth. The frontiers of fairness in machine learning.
CoRR ,abs/1810.08810, 2018. URL http://arxiv.org/abs/1810.08810 .S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review offair machine learning, 2018.K. W. Crenshaw.
On intersectionality: Essential writings . The New Press, 2017.C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness.In
Proceedings of the 3rd innovations in theoretical computer science conference , pages214–226, 2012.M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifyingand removing disparate impact, 2015.S. A. Friedler, C. Scheidegger, and S. Venkatasubramanian. On the (im)possibility of fairness,2016.P. Gordaliza, E. Del Barrio, G. Fabrice, and J.-M. Loubes. Obtaining fairness using optimaltransport theory. In
International Conference on Machine Learning , pages 2357–2365,2019.V. Gupta, P. Nokhiz, C. D. Roy, and S. Venkatasubramanian. Equalizing recourse acrossgroups. arXiv preprint arXiv:1909.03166 , 2019.15. Hanna, E. Denton, A. Smart, and J. Smith-Loud. Towards a critical race methodology inalgorithmic fairness. In
Proceedings of the 2020 conference on fairness, accountability, andtransparency , pages 501–512, 2020.M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning.
CoRR ,abs/1610.02413, 2016. URL http://arxiv.org/abs/1610.02413 .A. L. Hoffmann. Where fairness fails: data, algorithms, and the limits of antidiscriminationdiscourse.
Information, Communication & Society , 22(7):900–915, 2019. doi: 10.1080/1369118X.2019.1573912. URL https://doi.org/10.1080/1369118X.2019.1573912 .H. Hofmann. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) , 2000.L. Hu and Y. Chen. Fair classification and social welfare. In
Proceedings of the 2020Conference on Fairness, Accountability, and Transparency , FAT* ’20, page 535–545, NewYork, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi:10.1145/3351095.3372857. URL https://doi.org/10.1145/3351095.3372857 .R. Jiang, A. Pacchiano, T. Stepleton, H. Jiang, and S. Chiappa. Wasserstein fair classification,2019.J. E. Johndrow, K. Lum, et al. An algorithm for removing sensitive information: applicationto race-independent recidivism prediction.
The Annals of Applied Statistics , 13(1):189–220,2019.A.-H. Karimi, G. Barthe, B. Sch¨olkopf, and I. Valera. A survey of algorithmic recourse:definitions, formulations, solutions, and prospects. arXiv preprint arXiv:2010.04050 , 2020a.A.-H. Karimi, B. Sch¨olkopf, and I. Valera. Algorithmic recourse: from counterfactualexplanations to interventions. arXiv preprint arXiv:2002.06278 , 2020b.M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditingand learning for subgroup fairness. In J. Dy and A. Krause, editors,
Proceedings of the35th International Conference on Machine Learning , volume 80 of
Proceedings of MachineLearning Research , pages 2564–2572, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul2018. PMLR. URL http://proceedings.mlr.press/v80/kearns18a.html .J. Larson, S. Mattu, L. Kirchner, and J. Angwin. https://github.com/propublica/compas-analysis , 2016a.J. Larson, S. Mattu, L. Kirchner, and J. Angwin. How we analyzed the compas recidivismalgorithm.
ProPublica (5 2016) , 9(1), 2016b.M. Lohaus, M. Perrot, and U. V. Luxburg. Too relaxed to be fair. In H. D. III and A. Singh,editors,
Proceedings of the 37th International Conference on Machine Learning , volume 119of
Proceedings of Machine Learning Research , pages 6360–6369. PMLR, 13–18 Jul 2020.16. Lum and W. Isaac. To predict and serve?
Significance , 13(5):14–19, 2016. doi: https://doi.org/10.1111/j.1740-9713.2016.00960.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1740-9713.2016.00960.x .L. Munkhdalai, T. Munkhdalai, O.-E. Namsrai, J. Y. Lee, and K. H. Ryu. An empiricalcomparison of machine-learning methods on bank client credit assessments.
Sustainability ,11(3), 2019. ISSN 2071-1050. doi: 10.3390/su11030699. URL .G. Peyr´e, M. Cuturi, et al. Computational optimal transport.
Foundations and Trends inMachine Learning , 11(5-6):355–607, 2019.K. Rawal and H. Lakkaraju. Interpretable and interactive summaries ofactionable recourses. arXiv preprint arXiv:2009.07165 , 2020.A. D. Selbst, D. Boyd, S. A. Friedler, S. Venkatasubramanian, and J. Vertesi. Fairnessand abstraction in sociotechnical systems. In
Proceedings of the conference on fairness,accountability, and transparency , pages 59–68, 2019.C. Silvia, J. Ray, S. Tom, P. Aldo, J. Heinrich, and A. John. A general approach to fairnesswith optimal transport. In
Proceedings of the AAAI Conference on Artificial Intelligence ,volume 34, pages 3633–3640, 2020.B. Taskesen, J. Blanchet, D. Kuhn, and V. A. Nguyen. A statistical test for probabilisticfairness, 2020.B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatorypredictors. In
Conference on Learning Theory , pages 1920–1953. PMLR, 2017.Y. Wu, L. Zhang, and X. Wu. On convexity and bounds of fairness-aware classification. In
The World Wide Web Conference , WWW ’19, page 3356–3362, New York, NY, USA, 2019.Association for Computing Machinery. ISBN 9781450366748. doi: 10.1145/3308558.3313723.URL https://doi.org/10.1145/3308558.3313723 .S. Xue, M. Yurochkin, and Y. Sun. Auditing ml models for individual bias and unfairness,2020.M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: Mecha-nisms for fair classification, 2017.M. Zehlike, P. Hacker, and E. Wiedemann. Matching code and law: achieving algorithmicfairness with optimal transport.
Data Mining and Knowledge Discovery , 34(1):163–200,2020. 17 ppendix
Computing Individual and Group Bias
Our definitions of individual and group bias assume that we can measure the entire population X . In practice, we have data A = { a i } n i =1 ∼ A and B = { b j } n j =1 ∼ B , which requires slightchanges for computations. Definition 0.1 (Individual Bias) . The bias an individual a from group A experiences incomparison to group B across policies is measured by the expectation: u B ( a ) = (cid:88) b ∈ Γ( a,π ) d ( a, b ) π ( a, b ) . (16)We need to alter the definition of group bias to account for the fact that we can onlycompare an individual a ∈ G ⊂ A to their counterparts in B . Definition 0.2 (Group Bias) . The bias (comparatively) experienced across policies for agroup G within population A when compared to population B is measured by: U B ( A ∩ G ) = (cid:88) g ∈ A ∩ G u B ( g ) . (17)Finally, if we decompose the datasets A and B into disjoint groups A = { A , A , . . . A n } and B = { B , B , . . . , B m } , then the group bias measurement in population A relative topopulation B can be additively decomposed as U B ( A i ) = (cid:88) j ≤ m U B j ( A i )where U B j ( A i ) is the amount of bias measured in group A i within population A that is theresult of comparison with individuals in group B j within population B . We make use of thisframework to compare how individuals in the COMPAS dataset are mapped across race andcriminal history-based groups in Section 4.2 and how individuals in the German credit dataare mapped across age and sex-based groups in Section 4.3.In the case that an individual a in group A is very different from everyone in group B , theindividual bias in Definition 3.2 will be high regardless of who a is mapped to. Thus, to detectsuch outliers, we can normalize the individual bias by the individual’s own worst-case-scenario,i.e. when they have all of their mass mapped to the individual who is farthest from them: u ∗ B ( a ) = n (cid:80) n j =1 π ( a, b j ) d ( a, b j )max j =1: n d ( a, b j ) . (18)This new normalized individual bias can then be used to differentiate between cases when anindividual experiences extreme bias or merely has no equivalent comparison in feature space.18ariable GroupSavings ActionableCredit history ActionableOther installments ActionableCo-applicant or guarantor ActionableNumber liable individuals ActionableUnemployed ActionableProperty owned ActionableNumber existing credits ActionableInstallment rate ActionableCredit amount Non-actionableCredit duration Non-actionableCredit purpose Non-actionableTelephone Non-actionableSex ImmutableAge ImmuatbleForeign national ImmutableYears at job ImmutableSkilled employee ImmutalbePermanent resident since ImmutableTable A1: Actionable, non-actionable, and immutable features in the German credit dataset. COMPAS dataset
COMPAS is a commercial tool used to predict recidivism risk of individuals awaiting trial.The COMPAS dataset contains data compiled by Propublica Larson et al. [2016b] on 6172people who were scored by the COMPAS algorithm in Broward County, Florida from 2013-2014. Each person has a predicted recidivism score as well as a binary flag to indicate whetheror not they recidivated. There are 3,175 Black defendants, 2,103 white defendants, and 2,809defendants who recidivated within two years in this sample. It contains 9 features: two yearrecidivism, number of priors, age above 45, age below 25, female, misdemeanor, ethnicity, andpredicted recidivism probability. We one-hot encode the categorical variables and normalizethe features prior to fitting our models.
German credit dataset
The German credit dataset Hofmann [2000] contains 1000 instances with 26 features each ofloan applications at a bank. It includes applicant profiles as well as their classification as abad ( n = 300) or good ( n = 700) credit risk. We one-hot encode categorical variables andthen z-score normalize all features before fitting our logistic regression model.Table A1 describes how we classified features as actionable, non-actionable, and immutableas inspired by Chen et al. [2020].For each individual labeled as having bad credit, we interpolate their actionable featureswith the actionable features of those they were mapped to (Equation 15). Then, we reclassify19hese individuals. In Figure A1, we present the percentage of individuals who were reclassifiedas having good credit and record the average change for each actionable feature for varyingvalues of α .Figure A1: Average changes in actionable features in the German credit dataset afterindividuals with bad credit were interpolated (at level αα