[PDF] Statistical Equity: A Fairness Classification Objective

Abstract

Machine learning systems have been shown to propagate the societal errors of the past. In light of this, a wealth of research focuses on designing solutions that are "fair." Even with this abundance of work, there is no singular definition of fairness, mainly because fairness is subjective and context dependent. We propose a new fairness definition, motivated by the principle of equity, that considers existing biases in the data and attempts to make equitable decisions that account for these previous historical biases. We formalize our definition of fairness, and motivate it with its appropriate contexts. Next, we operationalize it for equitable classification. We perform multiple automatic and human evaluations to show the effectiveness of our definition and demonstrate its utility for aspects of fairness, such as the feedback loop.

Full PDF

SStatistical Equity: A Fairness Classiﬁcation Objective

Ninareh Mehrabi

University of Southern CaliforniaInformation Sciences [email protected]

Yuzhong Huang

University of Southern CaliforniaInformation Sciences [email protected]

Fred Morstatter

University of Southern CaliforniaInformation Sciences [email protected]

Abstract

Machine learning systems have been shown topropagate the societal errors of the past. Inlight of this, a wealth of research focuses ondesigning solutions that are “fair.” Even withthis abundance of work, there is no singular def-inition of fairness, mainly because fairness issubjective and context dependent. We proposea new fairness deﬁnition, motivated by the prin-ciple of equity, that considers existing biases inthe data and attempts to make equitable deci-sions that account for these previous historicalbiases. We formalize our deﬁnition of fairness,and motivate it with its appropriate contexts.Next, we operationalize it for equitable classi-ﬁcation. We perform multiple automatic andhuman evaluations to show the effectivenessof our deﬁnition and demonstrate its utility foraspects of fairness, such as the feedback loop.

With the omnipresent use of machine learning in dif-ferent decision and policy making environments, fair-ness has gained signiﬁcant importance. This becamethe case when researchers noticed that an AI systemused to measure recidivism risk in bail decisions wasbiased against certain racial groups [Angwin et al., 2016].As a reaction to the disclosure of this issue and vari-ous others, the AI community has made efforts to mit-igate biased and unfair outcomes in decision makingprocesses. Many researchers have proposed deﬁnitionsof algorithmic fairness, while others have tried to usethese deﬁnitions in different down-stream tasks in aneffort to overcome unfair outcomes. Despite the abun-dance of fairness deﬁnitions, the majority of them arenot complete [Gajane and Pechenizkiy, 2018]. Moreover, Equality Equity p ( (cid:98) Y | A = blue )= p ( (cid:98) Y | A = purple )= p ( (cid:98) Y | A = yellow ) p ( (cid:98) Y | A = blue ) + p ( Y | A = blue )= p ( (cid:98) Y | A = purple ) + p ( Y | A = purple )= p ( (cid:98) Y | A = yellow ) + p ( Y | A = yellow ) p ( Y | A ) p ( (cid:98) Y | A ) Figure 1: Notion of equality in fairness is depicted andformalized along with our newly formalized notion ofequity.theoretical analysis of these deﬁnitions have found thatmany at the forefront are incompatible with each other[Kleinberg, Mullainathan, and Raghavan, 2016]. Fornow at least, fairness remains a philosophical questionthat is not yet answered in the computational domain.In light of that, we propose and mathematically formal-ize the equity notion of fairness in which resources andoutcomes are distributed to overcome obstacles experi-enced by groups in order to maximize their opportunities[Schement, 2001]. In this work we take the perspectivethat historical biases should be compensated and disad-vantaged groups should be leveraged. We then introducea data-driven classiﬁcation objective function that opera-tionalizes the notion of equity in which existing historicalbiases in the training data are compensated through pre-dictions on the test data. This approach will not onlytarget ﬁxing biases but it will also target minimizing thefeedback loop phenomenon in which the biased data con-taminates the decision making outcome, and it continuesto stay and grow through the system.Our deﬁnition of fairness is an augmented version of sta-tistical parity [Dwork et al., 2012] that we adapt to mea- a r X i v : . [ c s . L G ] M a y ure equity. Unlike previous deﬁnitions and objectivefunctions in which only equality is considered in presentoutcomes, our approach will consider the historical biasespresent in the data and will combine the notion of eq-uity and afﬁrmative action while also satisfying equalityamongst groups. Our deﬁnition is a departure from differ-ential privacy [Jagielski et al., 2019] and fairness throughunawareness [Chen et al., 2019] since having access tosensitive attributes is necessary in some scenarios. For in-stance, in cases that pertain to the medical domain, accessto sensitive attributes such as gender and age are requiredto make a decision.Two different fairness realizations are depicted in Figure1. On the left side there is the notion of equality inwhich every group is given an equal amount of resources,which is too much for some members and insufﬁcientfor others. This is the problem that motivates this work:how can a classiﬁer produce predictions that are goodfor the majority of a group or society? This leads us tothe right picture which depicts equity where leverage isgiven through the model to give the groups appropriateresources to reach their goals. Note : We use parity (statistical parity) and equalityinterchangeably as synonyms throughout this paper.The contributions of this paper are as follows:1. We deﬁne and formalize equity as a fairness deﬁni-tion.2. We demonstrate how this deﬁnition can be made ac-tionable by proposing a loss function that combinesequity with the cross-entropy loss into the classiﬁer.Through experimentation, we demonstrate how ourdeﬁnition compares to the objective of equality.3. We then discuss and experimentally show the effec-tiveness of our deﬁnition in mitigating the feedbackloop problem [Chouldechova and Roth, 2018].4. Finally, we evaluate our fairness notion against theparity notion through human annotators and showhow these deﬁnitions fare in different real life sce-narios.

Equity is the distribution of resources among groups toovercome obstacles and to raise their opportunities for ac-cess [Schement, 2001]. Thus, historically disadvantagedgroups are compensated, and others get their fair share toreach their goals. To operationalize equity, we learn theexisting biases from the historical data and compensatefor them in the predictions generated by the model. This can be viewed similarly to afﬁrmative action in whichpresent decisions (algorithmic or otherwise) are made tocompensate for biases of the past but in such a way thatgroups reach an ultimate equality. The goal is to equalizeall the groups in the long run and give each group theirfair share.We modeled this phenomenon in our classiﬁca-tion objective function which will be described in detailin this section.Our objective function consists of two terms: • The Fairness Objective: In which the goal is to en-force equity amongst groups. • The Classiﬁcation Objective: In which we enforcethe classiﬁcation objective to achieve predictive ac-curacy.Finally, we combine these two objectives and control theimportance of each using a regularization parameter.

Herein, we formalize our equity notion of fairness. Let Y be a random variable denoting an outcome of a de-cision making process and A be the sensitive variable(e.g., demographic group membership). Let D be theset of all decisions made in the past (or historical deci-sions). Let the joint distribution p D ( Y, A ) summarizethe essential statistics of past decisions. We are inter-ested in the case when past decisions contain bias, i.e., p D ( Y | A = a ) (cid:54)≈ p D ( Y | A = b ) .Let M be the set of instances for which a predictionhas to be done using a machine learning method. Asmentioned earlier we want the decisions of the classiﬁerto account for and reverse the biases in the data in anequitable manner. We formalize this in the following way.Let a joint distribution p M ( Y, A ) summarize the essentialstatistics of the classiﬁer on M . Our goal is to generateequitable decisions for each group: p D ( Y = y | A = a ) p D ( A = a ) + p M ( Y = y | A = a ) p M ( A = a )= p D ( Y = y | A = b ) p D ( A = b ) + p M ( Y = y | A = b ) p M ( A = b ) , for each possible outcome y . We assume that we studythe same number of instances from each demographicgroup between both the historical and outcome sets (i.e., p D ( A = a ) = p D ( A = b ) = 1 / and p M ( A = a ) = p M ( A = b ) = 1 / ). Under this assumption, our fairnesscriterion becomes: p D ( Y = y | A = a ) + p M ( Y = y | A = a )= p D ( Y = y | A = b ) + p M ( Y = y | A = b ) . ne can interpret this criterion as equalizing odds of dif-ferent groups for a random person drawn with equal prob-ability from D or M .To comply with the notation widely used in the fairnessliterature, p D ( Y | A = a ) can also be written as p ( Y | A = a ) and p M ( Y | A = a ) as p ( ˆ Y | A = a ) .Finally, to translate our new fairness notion into widelyused format one can write our objective as follows: p ( ˆ Y | A = a ) + p ( Y | A = a ) = p ( ˆ Y | A = b ) + p ( Y | A = b ) . Deﬁnition 1 (Statistical Equity)

A predictor isstatically equitable among demographic groups, a and b , if it satisﬁes p ( ˆ Y | A = a ) + p ( Y | A = a ) = p ( ˆ Y | A = b ) + p ( Y | A = b ) . In order to satisfy the fairness objective and to be ableto couple it with our classiﬁcation loss, we divided ourtotal classiﬁcation objective into two terms. The ﬁrstterm, denoted by F equity , attempts to satisfy the fairnessobjective, and the other term, denoted by L , attempts tosatisfy the classiﬁcation loss. These two losses are thencontrolled and coupled by a regularization term β .The resulting fairness objective for each possible outcome y is coupled in our classiﬁer as follows: F equity ( θ ) = (cid:88) y (cid:32)(cid:104) n n (cid:88) i =1 p ( ˆ Y i = y | A = a ) + p ( Y = y | A = a ) (cid:105) − (cid:104) n n (cid:88) i =1 p ( ˆ Y i = y | A = b ) + p ( Y = y | A = b ) (cid:105)(cid:33) . Note that there is a difference between the notation ofhistorical and predictive (future) outcomes. By equalizingthe sum of historical plus future outcomes of one groupto another, we are enforcing afﬁrmative action and tryto compensate for observed historical biases in the databy correcting and adjusting the predictive outcomeso that eventually all the groups reach an equilibriumin our objective function. This equilibrium can be interms of all the groups satisfying their goals, e.g. thatin Figure 1. We deﬁned our classiﬁcation objective asthe cross-entropy loss L ( θ ) . Notice that L ( θ ) can beany other loss; however, cross-entropy is used in ourexperiments. Finally, by combining the fairness objectivewith the classiﬁcation objective, we deﬁne the wholeobjective of our fair classiﬁcation task as follows: min θ βF equity ( θ ) + (1 − β ) L ( θ ) . (1) β is a hyperparameter that controls the importance of thefairness constraint over the classiﬁcation objective. Bymaking the β value larger we are enforcing more of thefairness (equity) constraint in our objective, while smaller β value favors the classiﬁcation accuracy over the fairnessconstraint. The model used in these experiments is a two-layer densenetwork with 256 hidden dimensions. We stopped train-ing the model after seeing no improvement on the val-idation set for 100 iterations (validation epochs) in allthe experiments with a starting learning rate of 0.01 andapplied learning rate decay of 0.95. The code is availablefor reproduction of our experiments. We performed theexperiments on three different loss functions describedbelow. • Equity Loss: This loss function is our proposed lossfunction in Equation 1 introduced in the previoussection. • Parity Loss: This loss function is mirroring the sta-tistical parity notion of fairness which combines thestatistical parity notion with the classiﬁcation loss asfollows: F parity ( θ ) = (cid:88) y (cid:32) n n (cid:88) i =1 p ( ˆ Y i = y | A = a ) − n n (cid:88) i =1 p ( ˆ Y i = y | A = b ) (cid:33) . Where F parity represents the fairness loss correspond-ing to the parity notion of fairness. This loss will thenbe combined with the classiﬁcation cross-entropyloss as before which will form the whole objectiveloss as follows: min θ βF parity ( θ ) + (1 − β ) L ( θ ) . (2) • Classiﬁcation Loss (Cross-entropy): This is a lossfunction containing only the cross-entropy loss withno fairness constraints imposed as follows: min θ L ( θ ) . (3)In our results, Equity corresponds to a classiﬁer usingthe equity loss function deﬁned in Equation 1, Parityusing the parity loss function deﬁned in Equation 2, and https://github.com/Ninarehm/Fairness % ( A cc u r ac y ) Beta

COMPAS

Equity Parity Classifier % ( A cc u r ac y ) Beta

Adult

Equity Parity Classifier % ( F a i r n e ss G a i n ) Beta

COMPAS

Equity Parity Classifier % ( F a i r n e ss G a i n ) Beta

Adult

Equity Parity Classifier

Figure 2: Accuracy and fairness gain results for the COMPAS and Adult datasets over different β values. Top plotsreport the accuracy results, while bottom plots report the fairness gain results. Each point on the plots is the averagevalue of 10 experiments performed on the 10 random splits. Notice that the 10 random split sets are the same acrossdifferent β values. For details of these values along with standard deviation numbers refer to Tables 7 and 8 in theAppendixes section.Classiﬁer using the cross-entropy loss only in Equation 3.We tested these classiﬁers on two benchmark datasets infairness, COMPAS and Adult datasets, and reported theperformance accuracy and fairness gain as deﬁned below. Deﬁnition 2 (Fairness Gain)

For a given loss func-tion (cid:96) ∈ {

Equity , Parity , Classiﬁer } , we deﬁne thefairness gain relative to a simple classiﬁer with nofairness constraint for demographic groups a and b on the D ∪ M set as:Fairness Gain = [ | p ( Y | A = a ) − p ( Y | A = b ) | ] classiﬁer − [ | p ( Y | A = a ) − p ( Y | A = b ) | ] (cid:96) . In other words, we measure how effective a methodwas in reducing disparities among demographics com-pared to a classiﬁer with no fairness constraint. Notehow this measure is similar to the statistical parity ex-cept considering both the history and future predictiveoutcome.

For robustness purposes, each of our experiments wereperformed on 10 random train, validation, and test splitsfor each of the datasets and the signiﬁcance of our hy-potheses were reported accordingly. The reported resultsare averages of 10 experiments performed on 10 different random splits. Due to the existing variance in the splits,we found the MannWhitney U tests more suitable andreliable in performing experiments. Thus, we reported thesigniﬁcance of the averaged results in terms of p -valueinstead of standard deviation. However, the standard de-viations and detailed averaged results are all listed in theappendixes section. Hypothesis 1

The Equity classiﬁcation objectivewill achieve the highest gain in fairness, at aslight cost to accuracy. We expect this fairnessgain and accuracy degrade to be more noticeablefor higher β values which controls the importanceof fairness gain over classiﬁcation accuracy. The COMPAS dataset contains information about defen-dants from Broward County. The labels in our predictionclassiﬁcation task were weather or not a criminal willre-offend within two years. The sensitive attribute in ourexperiments was gender. Among features in this datasetwe used features as listed in Table 2. We split the datasetinto 10 different random 80-10-10 splits for train, test,and validation sets. The averaged accuracy and fairnessgain results obtained from applying different losses in ourlassiﬁcation task over 10 experiments on different splitswith different β values on the COMPAS dataset is shownin Figure 2.From results shown in Figure 2, we can observe thatclassiﬁer trained on our Equity loss is able to achievehigher fairness gain for all β values. We also show thesigniﬁcance of these results in terms of one vs all (Equityvs Parity and Classiﬁer) MannWhitney U test in Table 1for all the β values. Although from the results in Figure 2,one can observe a degrade in performance in terms of testaccuracy, the results in Table 3 show the insigniﬁcanceof this degrade for low to mid β values in this dataset.However, the degrade gets signiﬁcant for higher β valuesas expected due to the control of β over accuracy-fairnesstrade-off. Although we reported the results for all the β values from 0.1 to extreme of 0.9, we recommend a β value around 0.3-0.5 which balances the fairness gain andtest accuracy. The Adult dataset contains information about individu-als with a label corresponding if an individual’s incomeexceeds 50k per year or not. We utilized all the featuresfrom the dataset in our classiﬁcation task for predictingthe label. We considered gender as the protected attributein our classiﬁcation loss. The data was split into 10 differ-ent random 80-10-10 splits for train, test, and validationsets for each set of experiments. The averaged test accu-racy and fairness gain results over 10 different splits foreach β value obtained from applying different losses inour classiﬁcation task on the Adult dataset is shown inFigure 2.As shown in Figure 2, we can observe that for all β valuesour deﬁnition was able to achieve higher fairness gain.We also show the signiﬁcance of these results in Table 1.Although from the results in Figure 2, one can observe adegrade in performance in terms of test accuracy and thatresults in Table 3 show the signiﬁcance of this degrade,this degrade is still considered to be a reasonable pricefor fairness considering the gain in fairness. Especiallyfor mid β values in which the degrade can be perceivednegligible when considering the gain in fairness. As withthe COMPAS dataset, we recommend a β value around0.3-0.5 which balances the fairness gain and test accuracyfor this dataset as well. As expected in our initial hypothesis, through experimen-tation and hypothesis testing, we were able to gain knowl-edge that using the Equity loss in classiﬁcation will resultin gain in fairness. Through MannWhitney U signiﬁcance COMPAS Dataset Adult Dataset p -value p -valueBeta Parity Classiﬁer Parity Classiﬁer0.1 E qu it y β values. Theassumed test hypothesis was whether Equity will havegreater fairness gain compared to Parity and Classiﬁerlosses.Featuressex age cat racejuv fel count juv misd count juv other countpriors count c charge degreeTable 2: Features used in the experiments from the COM-PAS dataset.test we show that this gain is signiﬁcant for all the β val-ues for both of the datasets. With regards to degrade in testaccuracy, as expected, larger β values resulted in moreloss in test accuracy, while more gain in fairness. How-ever, this loss was shown to be non-signiﬁcant for oneof our datasets, the COMPAS dataset, for low to mid β values which we recommend using. For the Adult dataset,although the loss was shown to be statistically signiﬁcant,the test accuracy loss was reasonable considering the priceof fairness we get through the gain in fairness. Figure 2,demonstrates the behavior of different losses over differ-ent β values in terms of test accuracy and fairness gain forthe COMPAS and Adult datasets. Tables 1 and 3 indicatethe signiﬁcance of our hypothesis in terms of Equity lossbeing able to gain highest gain in fairness and also thesigniﬁcance of its degrade in performance in terms of testaccuracy over other baselines for the COMPAS and Adultdatasets respectively. From the overall results, we suggestuse of β values between 0.3-0.5 when using our Equityobjective as they are shown to be the most effective interms of gain in fairness and maintaining a reasonable testaccuracy.OMPAS Dataset Adult Dataset p -value p -valueBeta Parity Classiﬁer Parity Classiﬁer0.1 E qu it y β values. Thetest reports the signiﬁcance of degrade in performanceof Equity loss over the other two losses in terms of testaccuracy. An important and major concern in the fairnesscommunity is the feedback loop phenomenon[Chouldechova and Roth, 2018]. Since biased datais generated by humans, these biases are perpetuated afterthe models make biased decisions based on the historicalbiased data. The bias originates from humans, themodels amplify these biases, and they loop back biasedresults back to the humans. This loop gets repeatedand continues to carry the initial existing biases. Thisphenomenon is called the feedback loop phenomenon.We hope that since our notion considers and compensatesthe historical biases in the training set, which might havecome from humans in initial phases, and attempts to ﬁxthem by achieving an ultimate equilibrium considering thepast and future decisions, it may help with the mitigationof the feedback loop phenomenon.In order to observe the effect of our new equity notionon ﬁxing the historical biases in the training sets and ef-fectively ﬁxing the feedback loop as a consequence, weconducted experiments on datasets used in the previoussection and recorded averaged results over the 10 exper-iments on random splits along with their MannWhitneyU signiﬁcance tests in this section with the following hy-pothesis. The model architecture remains the same asexperiments conducted in the previous section. In addi-tion the experiments are performed on the Equity, Parity,and Classiﬁcation losses for comparison purposes. COMPAS Dataset Adult Dataset p -value p -valueIter Parity Classiﬁer Parity Classiﬁer1 E qu it y p -values show the signiﬁcance of our re-ported results in Figure 3 for β value of 0.5. Hypothesis 2

The Equity classiﬁcation objectivecan be the most effective in terms of reducingthe disparities (bias) deﬁned as | p ( Y | A = a ) − p ( Y | A = b ) | between demographic groups a and b over some iterations when predictive outcomeson the test sets are accumulated over time on thehistorical train sets. Herein, we answer the question of what will happen ifthe equity classiﬁer is allowed to play out in a realisticenvironment. We simulate the feedback loop as an iter-ative training-predicting cycle. We train our model insequential chunks, splitting the test data into 10 equal-sized chunks. At the ﬁrst iteration, we train the modelusing the train data. At each subsequent iteration, wetake one of the chunks from our test data adding it to theprevious train data alongside its predicted labels and re-train the model for the next iteration. We then deleted thischunk from the test set and keep it in the train set. Eachexperiment was repeated 10 times with different randomsplits.Figure 3 reports | p ( Y | A = female ) − p ( Y | A = male ) | ,averaged across 10 runs, as a measure for disparity forboth predicted class labels Y = 0 and Y = 1 ineach of the datasets for each of the fairness notions foreach β value. These results demonstrate that our no-tion of fairness was able to minimize the gap between p ( Y | A = female ) and p ( Y | A = male ) in all of thedatasets. The results show that using our notion can bring a) β = 0 . % ( B i a s ) Iteration

COMPAS

Equity Parity Classifier (b) β = 0 . % ( B i a s ) Iteration

COMPAS

Equity Parity Classifier (c) β = 0 . % ( B i a s ) Iteration

COMPAS

Equity Parity Classifier % ( B i a s ) Iteration

Adult

Equity Parity Classifier % ( B i a s ) Iteration

Adult

Equity Parity Classifier % ( B i a s ) Iteration

Adult

Equity Parity Classifier

Figure 3: Simulation of the feedback loop phenomenon and results obtained in reduction of bias via different methodsin COMPAS and Adult datasets. As expected higher β values result in reduction of more bias in the two fairness basedobjectives (Equity and Parity). It also shows how Equity is more effective in reducing the bias over iterations. Eachpoint on the plots is the average value of 10 experiments performed on the 10 random splits. Notice that the 10 randomsplit sets are the same across different β values.equality, equity, and fairness in long run and mitigatethe negative effects of the feedback loop phenomenon.As expected and shown in Figure 3, higher β values re-sulted in achieving more fair outcomes which resulted inreduction of bias. In addition, we reported the MannWhit-ney U test results to show the signiﬁcance of our results.Table 4 shows the signiﬁcance of these results for COM-PAS and Adult datasets for β value of 0.5 for differentiterations supporting our hypothesis. This is consistentwith our earlier ﬁnding that β = 0 . is the most effectiveand reasonable with signiﬁcant impact in gaining fair-ness, reducing bias, and balancing the fairness-accuracytrade-off. In order to understand the public’s perception of eq-uity (via our proposed deﬁnition) and its comparison toequality in different real life scenarios, we conductedsurveys on Amazon Mechanical Turk in the vein of[Saxena et al., 2019].

We recruited 150 workers by showing them four differentreal life scenarios. In each scenario, we proposed twofairness solutions: one based on equity and one based on Results for other β values can be found in thesupplementary material. equality/parity. For each scenario, we asked workers torate how fair they think each solution is on a scale of zeroto four. At the end of each scenario, we asked workersto select their preferred fairness solution for each sce-nario. We asked workers to provide written justiﬁcationfor their responses. In addition, we had a “sanity check”question at the end of our survey to discover and removeworkers behaving randomly. The screenshot from ourquestionnaire is included in the Appendixes section formore detailed information.A summary of the scenarios are as follows. Note that theexperimental results follow the same numbering conven-tion as listed below. • Scenario 1 (Equality vs Equity): We asked workersto rate pictures of equity and equality in Figure 1 andchose their preferred picture. • Scenario 2 (School Loan): Workers rate loan distri-bution mechanisms. One is based on equity, whichconsiders each student’s past history of receiving ascholarship (equity). Another simply proposes toequally distribute the loan among all the students(parity). • Scenario 3 (Government Subsidized Housing): Weasked respondents to rate the government subsidizedhousing distribution systems proposed in the survey—one based on equity considering how houses werehistorically distributed across different races (eq- cenario1 Scenario2 Scenario3 Scenario4 E qu it y

134 115

59 44 P a r it y

16 35

91 106

Table 5: Number of people preferring solutions providedby the equity vs solutions provided by the parity notionsof fairness in different scenarios.uity). The other proposes to equally distribute housesacross different racial categories (parity). • Scenario 4 (College Admission): We asked respon-dents to rate college admission systems—one basedon equity considering if the student is a ﬁrst gen-eration college student (equity). The other equallyadmits students from ﬁrst generation and non-ﬁrstgeneration backgrounds (parity).

After gathering and analyzing responses from mechanicalturk workers, we observed that there are some cases inwhich our notion of fairness is strongly preferred by alarge margin, and some other cases where preference isgiven to the parity notion. Fairness is subjective anddifferent people may have different takes on what wouldbe a fair solution to a particular case. That is the mainreason why we introduce this notion as not only in somescenarios our deﬁnition will be over-preferred but also insome non-preferred scenarios it will get some preferencefrom certain groups of people.The statistics of ratings for each of the 4 scenarios isshown in Figure 4. In addition, Table 5 depicts the num-ber of mechanical turk workers who preferred a certainsolution following a fairness deﬁnition in each of the sce-narios. Similar to ﬁndings in [Saxena et al., 2019], wealso observed the support for the principle of afﬁrmativeaction in our experiments which relates to our notion.From the results it is evident that strong preference isgiven to our notion introduced in this paper for scenarios1 and 2, and despite the fact that scenarios 3 and 4 are notover-preferred for our notion, there are still considerablenumber of people who gave preference to our notion inthese scenarios. All the justiﬁcations written down by therespondents were analyzed. For each preference recordedin this paper, respondents gave justiﬁcations that cover awide range of perspectives. The dataset can be found in . Parity Equity024 H u m a n R a t i n g s f o r S c e n a r i o Parity Equity024 H u m a n R a t i n g s f o r S c e n a r i o Parity Equity024 H u m a n R a t i n g s f o r S c e n a r i o Parity Equity024 H u m a n R a t i n g s f o r S c e n a r i o Figure 4: Human ratings of equity and parity notions offairness in different scenarios.

With relatively recent popularity of fairness in machinelearning and natural language processing domains, theneed to ﬁnd a universal and a more complete fairnessdeﬁnition and measure is crucial. Although ﬁnding suchdeﬁnition and measure is a challenge not only in ma-chine learning but also in social and political sciences,steps need to be taken to make current deﬁnitions evolveand cover more real world cases. In light of this manyfairness deﬁnitions have been proposed. Some tried tocomplement others and some starting a new directionand view-point on their own. Different body of worktried to incorporate the proposed deﬁnitions in differ-ent downstream tasks such as classiﬁcation and regres-sion [Menon and Williamson, 2018, Berk et al., 2017,Krasanakis et al., 2018, Agarwal, Dudik, and Wu, 2019,Goel, Yaghini, and Faltings, 2018].

For a more complete list of existing fairness deﬁnitionsthere exists papers that survey [Mehrabi et al., 2019] andexplain [Verma and Rubin, 2018] proposed deﬁnitions.Here we will elaborate some important and widely knowndeﬁnitions related to our work introduced in this paper.

In statistical parity [Dwork et al., 2012] the goal is to sat-isfy P ( ˆ Y | A = a ) = P ( ˆ Y | A = b ) . This notion states thatregardless of the belonging demographic group A , the pre-dicted outcome should be the same for both demographicgroups A = a and A = b . .1.2 Equalized Odds In equalized odds [Hardt et al., 2016] the goal is to satisfy P ( ˆ Y = 1 | A = a, Y = y ) = P ( ˆ Y = 1 | A = b, Y = y ) for y ∈ { , } . This notion states that both groups A = a and A = b should have equal true positive and falsepositive rates. In equal opportunity [Hardt et al., 2016] the goal is tosatisfy P ( ˆ Y = 1 | A = a, Y = 1) = P ( ˆ Y = 1 | A = b, Y = 1) . This notion states that both demographicgroups A = a and A = b should have equal true positiverates. In counterfactual fairness [Kusner et al., 2017] the goalis to satisfy P ( ˆ Y A ←− a ( U ) = y | X = x, A = a ) = P ( ˆ Y A ←− a (cid:48) ( U ) = y | X = x, A = a ) under any X = x and A = a for all y and for any value a (cid:48) feasible for A . The perception in counterfactual fairness deﬁnitionis that if the decision is the same in both the actual andcounterfactual world where the individual belonged to adifferent group then it is called a fair decision. Research in fairness domain does not conclude itself indeﬁning fairness deﬁnitions and measures but also incor-poration of these deﬁnitions in tasks such as classiﬁcation[Calders and Verwer, 2010, Huang and Vishnoi, 2019].This incorporation on its own is a challenge. Somemethods introduce pre-processing methods thataugment the train data for discrimination removal[Kamiran and Calders, 2012]. Some other methods per-form in-processing which tries to incorporate the fairnessobjective during training phase [Kamishima et al., 2012,Zafar et al., 2015, Wu, Zhang, and Wu, 2018]. Otherdeﬁnitions require post-processing techniques inwhich discrimination removal is performed afterthe training phase treating the model as a blackbox [Hardt et al., 2016, Pleiss et al., 2017]. Literature onfair classiﬁcation also targets a wide variety of fairnessdeﬁnitions, such as equality of opportunity and equalizedodds [Hardt et al., 2016, Woodworth et al., 2017], statis-tical parity [Agarwal et al., 2018], and subgroup fairness[Ustun, Liu, and Parkes, 2019], and their incorporationin classiﬁcation task.

In this work, we proposed a deﬁnition of fairness basedupon equity, demonstrated its appeal as a fairness outcome to a wide audience, and formalized it for classiﬁcation.We tested this approach in a traditional cross validationsetup, and demonstrated how it can be used in a real-worldenvironment, such as unfairness that can arise from thefeedback loop. Our results show the effectiveness of ourmethod in mitigating bias and achieving fairness. We alsoperformed human evaluation to evaluate our notion indifferent scenarios with the equality/parity notion of fair-ness. As a future direction, our deﬁnition can be utilizedto achieve and study the effects of equity in classiﬁca-tion with different techniques. In this work, we provide aframework for equity to be formalized; however, there isstill work to be done in the area of fairness with regardsto equity. Future work is to further study how the equitynotion interacts with other existing deﬁnitions of fairness,such as equality of opportunity, equalized odds or otherdeﬁnitions in the equality domain other than statisticalparity. It can also be extended to other machine learningtasks such as regression.

We wanted to thank Hrayr Harutyunyan and MozhdehGheini for their help and comments.

References [Agarwal et al., 2018] Agarwal, A.; Beygelzimer, A.;Dudik, M.; Langford, J.; and Wallach, H. 2018. Areductions approach to fair classiﬁcation. In

Interna-tional Conference on Machine Learning , 60–69.[Agarwal, Dudik, and Wu, 2019] Agarwal, A.; Dudik,M.; and Wu, Z. S. 2019. Fair regression: Quanti-tative deﬁnitions and reduction-based algorithms. In

International Conference on Machine Learning , 120–129.[Angwin et al., 2016] Angwin, J.; Larson, J.; Mattu, S.;and Kirchner, L. 2016. Machine bias.

ProPublica,May

Data Mining and Knowledge Dis-covery

Proceedings of the Conference onFairness, Accountability, and Transparency , FAT* 19,339348.[Chouldechova and Roth, 2018] Chouldechova, A., andRoth, A. 2018. The frontiers of fairness in machinelearning. arXiv preprint arXiv:1810.08810 .[Dwork et al., 2012] Dwork, C.; Hardt, M.; Pitassi, T.;Reingold, O.; and Zemel, R. 2012. Fairness throughawareness. In

Proceedings of the 3rd Innovations inTheoretical Computer Science Conference , ITCS ’12,214–226. New York, NY, USA: ACM.[Gajane and Pechenizkiy, 2018] Gajane, P., and Pech-enizkiy, M. 2018. On formalizing fairness in predictionwith machine learning.

Fairness, Accountability, andTransparency in Machine Learning .[Goel, Yaghini, and Faltings, 2018] Goel, N.; Yaghini,M.; and Faltings, B. 2018. Non-discriminatory ma-chine learning through convex fairness criteria. In

Thirty-Second AAAI Conference on Artiﬁcial Intelli-gence .[Hardt et al., 2016] Hardt, M.; Price, E.; Srebro, N.; et al.2016. Equality of opportunity in supervised learning.In

Advances in neural information processing systems ,3315–3323.[Huang and Vishnoi, 2019] Huang, L., and Vishnoi, N.2019. Stable and fair classiﬁcation. In

InternationalConference on Machine Learning , 2879–2890.[Jagielski et al., 2019] Jagielski, M.; Kearns, M.; Mao,J.; Oprea, A.; Roth, A.; Shariﬁ-Malvajerdi, S.; andUllman, J. 2019. Differentially private fair learning. In

International Conference on Machine Learning , 3000–3008.[Kamiran and Calders, 2012] Kamiran, F., and Calders,T. 2012. Data preprocessing techniques for classiﬁca-tion without discrimination.

Knowledge and Informa-tion Systems

Joint Euro-pean Conference on Machine Learning and KnowledgeDiscovery in Databases , 35–50. Springer.[Kleinberg, Mullainathan, and Raghavan, 2016]Kleinberg, J.; Mullainathan, S.; and Raghavan, M.2016. Inherent trade-offs in the fair determination ofrisk scores. arXiv preprint arXiv:1609.05807 .[Krasanakis et al., 2018] Krasanakis, E.; Spyromitros-Xiouﬁs, E.; Papadopoulos, S.; and Kompatsiaris, Y. 2018. Adaptive sensitive reweighting to mitigate biasin fairness-aware classiﬁcation. In

Proceedings of the2018 World Wide Web Conference , WWW ’18, 853–862.[Kusner et al., 2017] Kusner, M. J.; Loftus, J.; Russell,C.; and Silva, R. 2017. Counterfactual fairness. In

Advances in Neural Information Processing Systems30 . Curran Associates, Inc. 4066–4076.[Mehrabi et al., 2019] Mehrabi, N.; Morstatter, F.; Sax-ena, N.; Lerman, K.; and Galstyan, A. 2019. A sur-vey on bias and fairness in machine learning. arXivpreprint arXiv:1908.09635 .[Menon and Williamson, 2018] Menon, A. K., andWilliamson, R. C. 2018. The cost of fairness in bi-nary classiﬁcation. In Friedler, S. A., and Wilson,C., eds.,

Proceedings of the 1st Conference on Fair-ness, Accountability and Transparency , volume 81 of

Proceedings of Machine Learning Research , 107–118.PMLR.[Pleiss et al., 2017] Pleiss, G.; Raghavan, M.; Wu, F.;Kleinberg, J.; and Weinberger, K. Q. 2017. On fairnessand calibration. In

Advances in Neural InformationProcessing Systems 30 . Curran Associates, Inc. 5680–5689.[Saxena et al., 2019] Saxena, N. A.; Huang, K.; DeFilip-pis, E.; Radanovic, G.; Parkes, D. C.; and Liu, Y. 2019.How do fairness deﬁnitions fare?: Examining publicattitudes towards algorithmic deﬁnitions of fairness. In

Proceedings of the 2019 AAAI/ACM Conference on AI,Ethics, and Society , 99–106. ACM.[Schement, 2001] Schement, J. R. 2001. Imagining fair-ness: Equality and equity of access in search of democ-racy.

Libraries and Democracy: The Cornerstonesof Liberty ed. Nancy Kranich (Chicago: AmericanLibrary Association, 2001)

Proceedings of the36th International Conference on Machine Learning ,volume 97 of

Proceedings of Machine Learning Re-search , 6373–6382.[Verma and Rubin, 2018] Verma, S., and Rubin, J. 2018.Fairness deﬁnitions explained. In , 1–7. IEEE.Woodworth et al., 2017] Woodworth, B.; Gunasekar, S.;Ohannessian, M. I.; and Srebro, N. 2017. Learn-ing non-discriminatory predictors. arXiv preprintarXiv:1702.06081 .[Wu, Zhang, and Wu, 2018] Wu, Y.; Zhang, L.; and Wu,X. 2018. Fairness-aware classiﬁcation: Criterion,convexity, and bounds.[Zafar et al., 2015] Zafar, M. B.; Valera, I.; Rodriguez,M. G.; and Gummadi, K. P. 2015. Fairness con-straints: Mechanisms for fair classiﬁcation. arXivpreprint arXiv:1507.05259 . In this section we are going to report some additionaland detailed numbers reported in the main paper, suchas detailed averaged values shown in Figures 2 and 3for the 10 conduced experiments on different splits ofdata along with the corresponding standard deviations inparenthesis. As also mentioned in the main text, due to theexisting variance in different random splits of the dataset,we found reporting the p -values with Mann-Whitney Utest more suitable; however, here we also report detailedstandard deviations for the sake of completeness. Wewould also include details of our model architecture andalso the mechanical turk survey conducted and discussedin the main paper in this section.LayerType Parametersdense 256 hidden dimension, tanh activationdense 2 output dimensionTable 6: Architecture of model used in our experiments. Beta Equity Parity Classiﬁer A cc u r ac y F a i r n e ss G a i n β values. eta Equity Parity Classiﬁer A cc u r ac y F a i r n e ss G a i n β values.COMPAS Dataset Adult Dataset p -value p -valueIter Parity Classiﬁer Parity Classiﬁer1 E qu it y p -values show the signiﬁcance of our re-ported results in Figure 3 for β value of 0.1. COMPAS Dataset Adult Dataset p -value p -valueIter Parity Classiﬁer Parity Classiﬁer1 E qu it y p -values show the signiﬁcance of our re-ported results in Figure 3 for β value of 0.9. Beta Equity Parity Classiﬁer B i a s β value of 0.1. Beta Equity Parity Classiﬁer B i a s β value of 0.5. eta Equity Parity Classiﬁer B i a s β value of 0.9. Beta Equity Parity Classiﬁer B i a s β value of 0.1. Beta Equity Parity Classiﬁer B i a s β value of 0.5. Beta Equity Parity Classiﬁer B i a s β value of 0.9. Survey Instructions (Click to expand)

In this task, you will be given 4 different scenarios and we would ask you to rate how much proposed solutions to each of the scenarios would be fair on a scale of 0 to 4 (0 meaning completelyunfair and 4 meaning completely fair). We would also ask you to pick one of the solutions and tell us why you picked your preferred solution.

Attention: You should provide a justification in textboxes 1,2, 3, and 4 or you would not be paid. In other words, you should tell us why you chose your preferred picture/solution to each of the scenarios. Scenario 1: ---- Please rate the following two pictures according to their fairness degree.

Picture A (Picture on the left): (Picture on the right):

Which picture would you prefer the most?

I would prefer Picture A (Picture on the left). I would prefer Picture B (Picture on the right).

Please tell us why you chose your preferred picture. (Must be ﬁlled in order for you to get paid) Scenario 2: ---- 3 students with identical qualiﬁcations apply for a student loan to cover their $10k (each) tuition for a semester. The school can only oﬀer $21k in total loans.The circumstance of each student is as follows: Student A: Student A has previously received a $5k scholarship and applies for a loan to cover the rest of his/her $5k tuition.Student B: Student B has previously received a $4k scholarship and applies for a loan to cover the rest of his/her $6k tuition. Student C: Student C has received no scholarshipsand applies for the loan to cover her/his $10k tuition. How should the school allocate the loans to each of the students? Please rate the following solutions according to their fairness degree.

Solution 1:

The school should allocate a $7k loan each to Student A, B, and C.

The school should allocate a $5k loan to Student A, a $6k loan to Student B, and a $10k loan to Student C.

Which solution would you prefer the most for this scenario?

I would prefer solution 1. I would prefer solution 2.

Please tell us why you chose your preferred solution. (Must be ﬁlled in order for you to get paid) Scenario 3: ---- The government is awarding 100 subsidized houses to families every year. In previous years 80 of these houses went to Caucasian families while only 20 wentto people of color, although the applicant pool consisting equally of both groups. What would be the fair solution for government to take for this year's plan? Please rate the following solutions according to their fairness degree.

Solution 1:

The government should award 50 to people of color and 50 to the Caucasians.

80 houses should go to people of color, and 20 go to Caucasians to compensate for previous years.

Which solution would you prefer the most for this scenario?

I would prefer solution 1. I would prefer solution 2.

Please tell us why you chose your preferred picture. (Must be ﬁlled in order for you to get paid) Scenario 4: ---- A college has 4 openings left for their undergraduate admissions. 10 people with identical qualiﬁcations with the following background apply to the college: 5out of 10 of these applicants are going to be ﬁrst generation college students if they go to college, while the other 5 applicants are non-ﬁrst generation college students. Please rate the following solutions according to their fairness degree.

Solution 1:

The college should grant 2 admissions to 2 of the ﬁrst generation college students and 2 to 2 of the non-ﬁrst generation college students.

The college should grant admission to 3 of the ﬁrst generation college students and 1 to a non-ﬁrst generation college student.

Which solution would you prefer the most for this scenario?

I would prefer solution 1. I would prefer solution 2.

Please tell us why you chose your preferred picture. (Must be ﬁlled in order for you to get paid) Scenario 5: ---- How fair it would be if we do not pay you for your eﬀort on ﬁlling up this survey?

We have taken measures to prevent cheating and if you do not complete the task honestly we will know and the HIT will be rejected. (Optional)