[PDF] Optimized Data Pre-Processing for Discrimination Prevention

Abstract

Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we introduce a novel probabilistic formulation of data pre-processing for reducing discrimination. We propose a convex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility. We characterize the impact of limited sample size in accomplishing this objective, and apply two instances of the proposed optimization to datasets, including one on real-world criminal recidivism. The results demonstrate that all three criteria can be simultaneously achieved and also reveal interesting patterns of bias in American society.

Full PDF

OOptimized Data Pre-Processing for Discrimination Prevention

Flavio P. Calmon, Dennis Wei, Karthikeyan Natesan Ramamurthy, and Kush R. VarshneyData Science Department, IBM Thomas J. Watson Research Center ∗ Abstract

Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we intro-duce a novel probabilistic formulation of data pre-processing for reducing discrimination. We proposea convex optimization for learning a data transformation with three goals: controlling discrimination,limiting distortion in individual data samples, and preserving utility. We characterize the impact of lim-ited sample size in accomplishing this objective, and apply two instances of the proposed optimization todatasets, including one on real-world criminal recidivism. The results demonstrate that all three criteriacan be simultaneously achieved and also reveal interesting patterns of bias in American society.

Discrimination is the prejudicial treatment of an individual based on membership in a legally protectedgroup such as a race or gender. Direct discrimination occurs when protected attributes are used explicitlyin making decisions, which is referred to as disparate treatment in law. More pervasive nowadays is indirectdiscrimination, in which protected attributes are not used but reliance on variables correlated with themleads to signiﬁcantly diﬀerent outcomes for diﬀerent groups. The latter phenomenon is termed disparateimpact . Indirect discrimination may be intentional, as in the historical practice of “redlining” in the U.S. inwhich home mortgages were denied in zip codes populated primarily by minorities. However, the doctrineof disparate impact applies in many situations regardless of actual intent.Supervised learning algorithms, increasingly used for decision making in applications of consequence, mayat ﬁrst be presumed to be fair and devoid of inherent bias, but in fact, inherit any bias or discriminationpresent in the data on which they are trained (Calders & ˇZliobait˙e, 2013). Furthermore, simply removingprotected variables from the data is not enough since it does nothing to address indirect discriminationand may in fact conceal it. The need for more sophisticated tools has made discrimination discovery andprevention an important research area (Pedreschi et al., 2008).Algorithmic discrimination prevention involves modifying one or more of the following to ensure thatdecisions made by supervised learning methods are less biased: (a) the training data, (b) the learningalgorithm, and (c) the ensuing decisions themselves. These are respectively classiﬁed as pre-processing(Hajian, 2013), in-processing (Fish et al., 2016; Zafar et al., 2016; Kamishima et al., 2011) and post-processingapproaches (Hardt et al., 2016). In this paper, we focus on pre-processing since it is the most ﬂexible interms of the data science pipeline: it is independent of the modeling algorithm and can be integrated withdata release and publishing mechanisms.Researchers have also studied several notions of discrimination and fairness. Disparate impact is addressedby the principles of statistical parity and group fairness (Feldman et al., 2015), which seek similar outcomesfor all groups. In contrast, individual fairness (Dwork et al., 2012) mandates that similar individuals betreated similarly irrespective of group membership. For classiﬁers and other predictive models, equal errorrates for diﬀerent groups are a desirable property (Hardt et al., 2016), as is calibration or lack of predictivebias in the predictions (Zhang & Neill, 2016). The tension between the last two notions is described byKleinberg et al. (2017) and Chouldechova (2016); the work of Friedler et al. (2016) is in a similar vein.Corbett-Davies et al. (2017) discuss the cost of satisfying prevailing notions of algorithmic fairness from apublic safety standpoint and discuss the trade-oﬀs. Since the present work pertains to pre-processing and ∗ Contact: { fdcalmon,dwei,knatesa,krvarshn } @us.ibm.com a r X i v : . [ s t a t . M L ] A p r earn/Apply Transformation Original data { ( X i , Y i ) } Discriminatoryvariable { D i } Utility: p X,Y p ˆ X, ˆ Y Individual distortion: ( x i , y i ) (ˆ x i , ˆ y i ) Discrimination control: ˆ Y i D i Learn/Applypredictivemodel ( ˆ Y | ˆ X, D ) Transformed data { ( D i , ˆ X i , ˆ Y i ) } Figure 1:

The proposed pipeline for predictive learning with discrimination prevention.

Learn mode applies withtraining data and apply mode with novel test data. Note that test data also requires transformation before predictionscan be obtained. not modeling, balanced error rates and predictive bias are less relevant criteria. Instead we focus primarilyon achieving group fairness while also accounting for individual fairness through a distortion constraint.Existing pre-processing approaches include sampling or re-weighting the data to neutralize discriminatoryeﬀects (Kamiran & Calders, 2012), changing the individual data records (Hajian & Domingo-Ferrer, 2013),and using t -closeness (Li et al., 2007) for discrimination control (Ruggieri, 2014). A common theme is theimportance of balancing discrimination control against utility of the processed data. However, this prior workneither presents general and principled optimization frameworks for trading oﬀ these two criteria, nor allowsconnections to be made to the broader statistical learning and information theory literature via probabilisticdescriptions. Another shortcoming is that individual distortion or fairness is not made explicit.In this work, addressing gaps in the pre-processing literature, we introduce a probabilistic frameworkfor discrimination-preventing pre-processing in supervised learning. Our aim in part is to work toward amore uniﬁed view of previously proposed concepts and methods, which may help to suggest reﬁnements.We formulate the determination of a pre-processing transformation as an optimization problem that tradesoﬀ discrimination control, data utility, and individual distortion. (Trade-oﬀs among various fairness notionsmay be inherent as shown by Kleinberg et al. (2017).) While discrimination and utility are deﬁned at thelevel of probability distributions, distortion is controlled on a per-sample basis, thereby limiting the eﬀectof the transformation on individuals and ensuring a degree of individual fairness. Figure 1 illustrates thesupervised learning pipeline that includes our proposed discrimination-preventing pre-processing.The work of Zemel et al. (2013) is closest to ours in also presenting a framework with three criteriarelated to discrimination control (group fairness), individual fairness, and utility. However, the criteriaare manifested less directly than in our proposal. In particular, discrimination control is posed in termsof intermediate features rather than outcomes, individual distortion does not take outcomes into account(simply being an (cid:96) -norm between original and transformed features), and utility is speciﬁc to a particularclassiﬁer. Our formulation more naturally and generally encodes these fairness and utility desiderata.Given the novelty of our formulation, we devote more eﬀort than usual to discussing its motivations andpotential variations. We state natural conditions under which the proposed optimization problem is convex.The resulting transformation is in general a randomized one. The proposed optimization problem assumes asinput an estimate of the distribution of the data which, in practice, can be imprecise due to limited samplesize. Accordingly, we characterize the possible degradation in discrimination and utility guarantees at testtime in terms of the training sample size. As a demonstration of our framework, we apply speciﬁc instancesof it to a prison recidivism risk score dataset (ProPublica, 2017) and the UCI adult dataset (Lichman,2013). By solving the optimization problem, we show that discrimination, distortion, and utility loss canbe controlled simultaneously with real data. In addition, the resulting transformations reveal intriguingdemographic patterns in the data. We are given a dataset consisting of n i.i.d. samples { ( D i , X i , Y i ) } ni =1 from a joint distribution p D,X,Y withdomain

D × X × Y . Here D denotes one or more discriminatory variables such as gender and race, X denotes other non-protected variables used for decision making, and Y is an outcome random variable. Forinstance, Y i could represent a loan approval decision for individual i based on demographic information D i and credit score X i . We focus in this paper on discrete (or discretized) and ﬁnite domains D and X andbinary outcomes, i.e. Y = { , } . There is no restriction on the dimensions of D and X .2ur goal is to determine a randomized mapping p ˆ X, ˆ Y | X,Y,D that (i) transforms the given dataset into anew dataset (cid:110) ( D i , ˆ X i , ˆ Y i ) (cid:111) ni =1 , which may be used to train a model, and (ii) similarly transforms data towhich the model is applied, i.e. test data. Each ( ˆ X i , ˆ Y i ) is drawn independently from the same domain X × Y as X, Y by applying p ˆ X, ˆ Y | X,Y,D to the corresponding triplet ( D i , X i , Y i ). Since D i is retained as-is, we donot include it in the mapping to be determined. Motivation for retaining D is discussed later in Section 3.2.For test samples, Y i is not available at the input while ˆ Y i may not be needed at the output. In this case, areduced mapping p ˆ X | X,D may be used, which can be obtained from p ˆ X, ˆ Y | X,Y,D by marginalizing over ˆ Y and Y after weighting by p Y | X,D .It is assumed that p D,X,Y is known along with its marginals and conditionals. This assumption is oftensatisﬁed using the empirical distribution of { ( D i , X i , Y i ) } ni =1 . In Section 3.2, we state a result ensuring thatdiscrimination and utility loss continue to be controlled if the distribution used to determine p ˆ X, ˆ Y | X,Y,D diﬀers from the distribution of test samples.We propose that the mapping p ˆ X, ˆ Y | X,Y,D satisfy the properties discussed in the following three subsec-tions.

The ﬁrst objective is to limit the dependence of the transformed outcome ˆ Y on the discriminatory variables D , as represented by the conditional distribution p ˆ Y | D . We propose two alternative formulations. The ﬁrstrequires p ˆ Y | D to be close to a target distribution p Y T for all values of D , J (cid:16) p ˆ Y | D ( y | d ) , p Y T ( y ) (cid:17) ≤ (cid:15) y,d ∀ d ∈ D , y ∈ { , } , (1)where J ( · , · ) denotes some distance function. The second formulation constrains p ˆ Y | D to be similar for anytwo values of D , J (cid:16) p ˆ Y | D ( y | d ) , p ˆ Y | D ( y | d ) (cid:17) ≤ (cid:15) y,d ,d (2)for all d , d ∈ D , y ∈ { , } . The latter (2) does not require a target distribution as reference but doesincrease the number of constraints from O ( |D| ) to O ( |D| ).The choice of target p Y T in (1), and distance J and thresholds (cid:15) in (1) and (2) should be informed bysocietal considerations. If the application domain has a clear legal deﬁnition of disparate impact, for examplethe “80% rule” (EEOC, 1979), then it can be translated into a mathematical constraint. Otherwise and moregenerally, the instantiation of (1) should involve consultation with domain experts and stakeholders beforebeing put into practice.For this work, we choose J to be the following probability ratio measure: J ( p, q ) = (cid:12)(cid:12)(cid:12)(cid:12) pq − (cid:12)(cid:12)(cid:12)(cid:12) . (3)The combination of (3) and (1) generalizes the extended lift criterion proposed in the literature (Pedreschiet al., 2012), while the combination of (3) and (2) generalizes selective and contrastive lift. In the numericalresults in Section 4, we use both (1) and (2). For (1), we make the straightforward choice of setting p Y T = p Y ,the original marginal distribution of the outcome variable. We recognize however that this choice of targetmay run the risk of perpetuating bias in the original dataset. On the other hand, how to choose a targetdistribution that is “fairer” than p Y is largely an open question; we refer the reader to ˇZliobait˙e et al. (2011)for one such proposal, which is reminiscent of the concept of “balanced error rate” in classiﬁcation (Zhaoet al., 2013).In (1) and (2), discrimination control is imposed jointly with respect to all discriminatory variables,e.g. all combinations of gender and race if D consists of those two variables. An alternative is to take thediscriminatory variables one at a time, e.g. gender without regard to race and vice-versa. The latter, whichwe refer to as univariate discrimination control, can be formulated similarly to (1), (2). In this work, weopt for joint discrimination control as it is more stringent than univariate. We note however that legalformulations tend to be of the univariate type. 3ormulations (1) and (2) control discrimination at the level of the overall population in the dataset. Itis also possible to control discrimination within segments of the population by conditioning on additionalvariables B , where B is a subset of X and X is a collection of features. Constraint (1) would then generalizeto J (cid:16) p ˆ Y | D,B ( y | d, b ) , p Y T | B ( y | b ) (cid:17) ≤ (cid:15) y,d,b (4)for all d ∈ D , y ∈ { , } , and b ∈ B . Similar conditioning or “context” for discrimination has been exploredbefore in Hajian & Domingo-Ferrer (2013) in the setting of association rule mining. As one example, B may consist of non-discriminatory variables that are strongly correlated with the outcome Y , e.g. educationlevel as it relates to income. One may wish to control for such variables in determining whether discrimi-nation is present and needs to be corrected. At the same time, care must be taken so that the populationsegments created by conditioning on B are large enough for statistically valid inferences to be made. Forpresent purposes, we simply note that conditional discrimination constraints (4) can be accommodated inour framework and defer further investigation to future work. The mapping p ˆ X, ˆ Y | X,Y,D should satisfy distortion constraints with respect to the domain

X × Y . Theseconstraints restrict the mapping to reduce or avoid altogether certain large changes (e.g. a very low creditscore being mapped to a very high credit score). Given a distortion metric δ : ( X × Y ) → R + , we constrainthe conditional expectation of the distortion as follows: E (cid:104) δ (( x, y ) , ( ˆ X, ˆ Y )) | D = d, X = x, Y = y (cid:105) ≤ c d,x,y ∀ ( d, x, y ) ∈ D × X × Y . (5)We assume that δ ( x, y, x, y ) = 0 for all ( x, y ) ∈ X × Y .Constraint (5) is formulated with pointwise conditioning on ( D, X, Y ) = ( d, x, y ) in order to promote individual fairness. It ensures that distortion is controlled for every combination of ( d, x, y ), i.e. everyindividual in the original dataset, and more importantly, every individual to which a model is later applied.By way of contrast, an average-case measure in which an expectation is also taken over

D, X, Y may resultin high distortion for certain ( d, x, y ), likely those with low probability. Equation (5) also allows the level ofcontrol c d,x,y to depend on ( d, x, y ) if desired. We also note that (5) is a property of the mapping p ˆ X, ˆ Y | D,X,Y ,and does not depend on the assumed distribution p D,X,Y . The expectation over ˆ X, ˆ Y in (5) encompasses several cases depending on the choices of the metric δ and thresholds c d,x,y . If c d,x,y = 0, then no mappings with nonzero distortion are allowed for individualswith original values ( d, x, y ). If c d,x,y >

0, then certain mappings may still be disallowed by assigning theminﬁnite distortion. Mappings with ﬁnite distortion are permissible subject to the budget c d,x,y . Lastly, if δ is binary-valued (perhaps achieved by thresholding a multi-valued distortion function), it can be seen asclassifying mappings into desirable ( δ = 0) and undesirable ones ( δ = 1). Here, (5) reduces to a bound onthe conditional probability of an undesirable mapping, i.e.Pr (cid:16) δ (( x, y ) , ( ˆ X, ˆ Y )) = 1 | D = d, X = x, Y = y (cid:17) ≤ c d,x,y . (6) In addition to constraints on individual distortions, we also require that the distribution of ( ˆ X, ˆ Y ) be sta-tistically close to the distribution of ( X, Y ). This is to ensure that a model learned from the transformeddataset (when averaged over the discriminatory variables D ) is not too diﬀerent from one learned fromthe original dataset, e.g. a bank’s existing policy for approving loans. For a given dissimilarity measure ∆between probability distributions (e.g. KL-divergence), we require that ∆ (cid:16) p ˆ X, ˆ Y , p X,Y (cid:17) be small.

Putting together the considerations from the three previous subsections, we arrive at the optimizationproblem below for determining a randomized transformation p ˆ X, ˆ Y | X,Y,D mapping each sample ( D i , X i , Y i )4o ( ˆ X i , ˆ Y i ): min p ˆ X, ˆ Y | X,Y,D ∆ (cid:16) p ˆ X, ˆ Y , p X,Y (cid:17) s.t. J (cid:16) p ˆ Y | D ( y | d ) , p Y T ( y ) (cid:17) ≤ (cid:15) y,d and E (cid:104) δ (( x, y ) , ( ˆ X, ˆ Y )) | D = d, X = x, Y = y (cid:105) ≤ c d,x,y ∀ ( d, x, y ) ∈ D × X × Y ,p ˆ X, ˆ Y | X,Y,D is a valid distribution. (7)We choose to minimize the utility loss ∆ subject to constraints on individual distortion (5) and discrimination,where we have used (1) for concreteness, since it is more natural to place bounds on the latter two.The distortion constraints (5) are an essential component of the problem formulation (7). Without (5)and assuming that p Y T = p Y , it is possible to achieve perfect utility and non-discrimination simply by sam-pling ( ˆ X i , ˆ Y i ) from the original distribution p X,Y independently of any inputs, i.e. p ˆ X, ˆ Y | X,Y,D (ˆ x, ˆ y | x, y, d ) = p ˆ X, ˆ Y (ˆ x, ˆ y ) = p X,Y (ˆ x, ˆ y ). Then ∆ (cid:16) p ˆ X, ˆ Y , p X,Y (cid:17) = 0, and p ˆ Y | D ( y | d ) = p ˆ Y ( y ) = p Y ( y ) = p Y T ( y ) for all d ∈ D . This solution however is clearly objectionable from the viewpoint of individual fairness, especiallyfor individuals to whom a subsequent model is applied since it amounts to discarding an individual’s dataand replacing it with a random sample from the population p X,Y . Constraint (5) seeks to prevent such grossdeviations from occurring.

We ﬁrst discuss conditions under which (7) is a convex or quasiconvex optimization problem. Consideringﬁrst the objective function, the distribution p X,Y is a given quantity while p ˆ X, ˆ Y (ˆ x, ˆ y ) = (cid:88) d,x,y p D,X,Y ( d, x, y ) p ˆ X, ˆ Y | D,X,Y (ˆ x, ˆ y | d, x, y )is seen to be a linear function of the mapping p ˆ X, ˆ Y | D,X,Y , i.e. the optimization variable. Hence if thestatistical dissimilarity ∆( · , · ) is convex in its ﬁrst argument with the second ﬁxed, then ∆( p ˆ X, ˆ Y , p X,Y ) isa convex function of p ˆ X, ˆ Y | D,X,Y by the aﬃne composition property (Boyd & Vandenberghe, 2004). Thiscondition is satisﬁed for example by all f -divergences (Csisz´ar & Shields, 2004), which are jointly convex inboth arguments, and by all Bregman divergences (Banerjee et al., 2005). If instead ∆( · , · ) is only quasiconvexin its ﬁrst argument, a similar composition property implies that ∆( p ˆ X, ˆ Y , p X,Y ) is a quasiconvex functionof p ˆ X, ˆ Y | D,X,Y (Boyd & Vandenberghe, 2004).For discrimination constraint (1), the target distribution p Y T is assumed to be given. The conditionaldistribution p ˆ Y | D can be related to p ˆ X, ˆ Y | D,X,Y as follows: p ˆ Y | D (ˆ y | d ) = (cid:88) ˆ x (cid:88) x,y p X,Y | D ( x, y | d ) p ˆ X, ˆ Y | D,X,Y (ˆ x, ˆ y | d, x, y ) . Since p X,Y | D is given, p ˆ Y | D is a linear function of p ˆ X, ˆ Y | D,X,Y . Hence by the same composition property asabove, (1) is a convex constraint, i.e. speciﬁes a convex set, if the distance function J ( · , · ) is quasiconvex inits ﬁrst argument.If constraint (2) is used instead of (1), then both arguments of J are linear functions of p ˆ X, ˆ Y | D,X,Y .Hence (2) is convex if J is jointly quasiconvex in both arguments.Lastly, the distortion constraint (5) can be expanded explicitly in terms of p ˆ X, ˆ Y | D,X,Y to yield (cid:88) ˆ x, ˆ y p ˆ X, ˆ Y | D,X,Y (ˆ x, ˆ y | d, x, y ) δ (( x, y ) , (ˆ x, ˆ y )) ≤ c d,x,y . Thus (5) is a linear constraint in p ˆ X, ˆ Y | D,X,Y regardless of the choice of distortion metric δ .We summarize this subsection with the following proposition.5 roposition 1. Problem (7) is a (quasi)convex optimization if ∆( · , · ) is (quasi)convex and J ( · , · ) is quasi-convex in their respective ﬁrst arguments (with the second arguments ﬁxed). If discrimination constraint (2) is used in place of (1) , then the condition on J is that it be jointly quasiconvex in both arguments. We now discuss the generalizability of discrimination guarantees (1) and (2) to unseen individuals, i.e.those to whom a model is applied. Recall from Section 2 that the proposed transformation retains thediscriminatory variables D . We ﬁrst consider the case where models trained on the transformed data topredict ˆ Y are allowed to depend on D . While such models may qualify as disparate treatment, the intentand eﬀect is to better mitigate disparate impact resulting from the model. In this respect our proposalshares the same spirit with “fair” aﬃrmative action in Dwork et al. (2012) (fairer on account of distortionconstraint (5)). Later in this subsection we consider the case where D is suppressed at classiﬁcation time. Assuming that predictive models for ˆ Y can depend on D , let (cid:101) Y be the output of such a model based on D andˆ X . To remove the separate issue of model accuracy, suppose for simplicity that the model provides a goodapproximation to the conditional distribution of ˆ Y : p (cid:101) Y | ˆ X,D ( (cid:101) y | ˆ x, d ) ≈ p ˆ Y | ˆ X,D ( (cid:101) y | ˆ x, d ). Then for individualsin a protected group D = d , the conditional distribution of (cid:101) Y is given by p (cid:101) Y | D ( (cid:101) y | d ) = (cid:88) ˆ x p (cid:101) Y | ˆ X,D ( (cid:101) y | ˆ x, d ) p ˆ X | D (ˆ x | d ) ≈ (cid:88) ˆ x p ˆ Y | ˆ X,D ( (cid:101) y | ˆ x, d ) p ˆ X | D (ˆ x | d ) = p ˆ Y | D ( (cid:101) y | d ) . (8)Hence the model output p (cid:101) Y | D can also be controlled by (1) or (2).On the other hand, if D must be suppressed from the transformed data, perhaps to comply with legalrequirements regarding its non-use, then a predictive model can depend only on ˆ X and approximate p ˆ Y | ˆ X ,i.e. p (cid:101) Y | ˆ X,D ( (cid:101) y | ˆ x, d ) = p (cid:101) Y | ˆ X ( (cid:101) y | ˆ x ) ≈ p ˆ Y | ˆ X ( (cid:101) y | ˆ x ). In this case we have p (cid:101) Y | D ( (cid:101) y | d ) ≈ (cid:88) ˆ x p ˆ Y | ˆ X ( (cid:101) y | ˆ x ) p ˆ X | D (ˆ x | d ) , (9)which in general is not equal to p ˆ Y | D ( (cid:101) y | d ) in (8). The quantity on the right-hand side of (9) is less straight-forward to control. We address this issue in the next subsection. In many applications the discriminatory variable cannot be revealed to the classiﬁcation algorithm. Inthis case, the train-time discrimination guarantees are preserved at apply time if the Markov relationship D → ˆ X → ˆ Y (i.e. p ˆ Y | ˆ X,D = p ˆ Y | ˆ X ) holds since, in this case, p (cid:101) Y | D ( (cid:101) y | d ) ≈ (cid:88) ˆ x p ˆ Y | ˆ X ( (cid:101) y | ˆ x ) p ˆ X | D (ˆ x | d ) = p ˆ Y | D ( (cid:101) y | d ) . (10)Thus, given that the distribution p D,X,Y is known, the guarantees provided during training still hold whenapplied to fresh samples if the additional constraint p ˆ X, ˆ Y | D,X,Y = p ˆ Y | ˆ X p ˆ X | D,X,Y is satisﬁed. We refer to(7) with the additional constraint p ˆ X, ˆ Y | D,X,Y = p ˆ Y | ˆ X p ˆ X | D,X,Y as the suppressed optimization formulation (SOF). Alas, since the added constraint is non-convex, the SOF is not a convex program, despite beingconvex in p ˆ X | D,X,Y for a ﬁxed p ˆ Y | ˆ X and vice-versa (i.e. it is biconvex). We propose next two strategies foraddressing this problem.1. The ﬁrst approach is to restrict p ˆ Y | ˆ X = p Y | X and solve (7) for p ˆ X | D,X,Y . If ∆( · , · ) is an f -divergence,then ∆ (cid:16) p X,Y , p ˆ X, ˆ Y (cid:17) = D f (cid:16) p X,Y (cid:107) p ˆ X, ˆ Y (cid:17) (cid:88) x,y p ˆ X, ˆ Y ( x, y ) f (cid:32) p X,Y ( x, y ) p ˆ X, ˆ Y ( x, y ) (cid:33) ≥ (cid:88) x p ˆ X ( x ) f (cid:32)(cid:88) y p ˆ Y | ˆ X ( x | y ) p X,Y ( x, y ) p ˆ X, ˆ Y ( x, y ) (cid:33) = D f (cid:0) p X (cid:107) p ˆ X (cid:1) , where the inequality follows from convexity of f . Since the last quantity is achieved by setting p ˆ Y | ˆ X = p Y | X , this choice is optimal in terms of the objective function. It may, however, render the constraintsin (7) infeasible. Assuming feasibility is maintained, this approach has the added beneﬁt that aclassiﬁer f θ ( x ) ≈ p Y | X ( ·| x ) can be trained using the original (non-perturbed) data, and maintained forclassiﬁcation during apply time.2. Alternatively, a solution can be found through alternating minimization: ﬁx p ˆ Y | ˆ X and solve the SOFfor p ˆ X | D,X,Y , and then ﬁx p ˆ X | D,X,Y as the optimal solution and solve the SOF for p ˆ Y | ˆ X . The resultingsequence of values of the objective function is non-increasing, but may converge to a local minima. There is a close relationship between estimation and discrimination. If the discriminatory variable D canbe reliably estimated from the outcome variable Y , then it is reasonable to expect that the discriminationcontrol constraint (1) does not hold for small values of (cid:15) y,d . We make this intuition precise in the nextproposition when J is given in (3).More speciﬁcally, we prove that if the advantage of estimating D from Y over a random guess is large, thenthere must exist a value of d and y such that J ( p Y | D ( y | d ) , p Y T ( y )) is also large. Thus, standard estimationmethods can be used to detect the presence of discrimination: if an estimation algorithm can estimate D from Y , then discrimination may be present. Alternatively, if discrimination control is successful, then noestimator can signiﬁcantly improve upon a random guess when estimating D from Y .We denote the highest probability of correctly guessing D from an observation of Y by P c ( D | Y ), where P c ( D | Y ) (cid:44) max D → Y → ˆ D Pr (cid:16) D = ˆ D (cid:17) , (11)and the maximum is taken across all estimators p ˆ D | Y that satisfy the Markov condition D → Y → ˆ D . For D and Y deﬁned over ﬁnite supports, this is achieved by the maximum a posteriori (MAP) estimator and,consequently, P c ( D | Y ) = (cid:88) y ∈Y p Y ( y ) max d ∈D p D | Y ( d | y ) . (12)Let p ∗ D be the most likely outcome of D , i.e. p ∗ D (cid:44) max d ∈D p D ( d ). The (multiplicative) advantage over arandom guess is given by Adv ( D | Y ) (cid:44) P c ( D | Y ) p ∗ D . (13) Proposition 2.

For D and Y deﬁned over ﬁnite support sets, if Adv ( D | Y ) > (cid:15) (14) then for any p Y T , there exists y ∈ Y and d ∈ D such that (cid:12)(cid:12)(cid:12)(cid:12) p Y | D ( y | d ) p Y T ( y ) − (cid:12)(cid:12)(cid:12)(cid:12) > (cid:15). (15) Proof.

We prove the contrapositive of the statement of the proposition. Assume that (cid:12)(cid:12)(cid:12)(cid:12) p Y | D ( y | d ) p Y T ( y ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) ∀ y ∈ Y , d ∈ D . (16)7hen P c ( D | Y ) = (cid:88) y ∈Y max d ∈D p D | Y ( d | y ) p Y ( y )= (cid:88) y ∈Y max d ∈D p Y | D ( y | d ) p D ( d ) ≤ (cid:88) y ∈Y max d ∈D (1 + (cid:15) ) p Y T ( y ) p D ( d )= (1 + (cid:15) ) max d ∈D p D ( d ) , where the inequality follows by noting that (16) implies p Y | D ( y | d ) ≤ (1 + (cid:15) ) p Y T ( y ) for all y ∈ Y , d ∈ D .Rearranging the terms of the last equality, we arrive at P c ( D | Y )max d ∈D p D ( d ) ≤ (cid:15), and the result follows by observing that the left-hand side is the deﬁnition of Adv ( D | Y ). The proposed optimization framework has two modes of operation (Fig. 1): train and apply. In trainmode, the optimization problem (7) is solved in order to determine a mapping p ˆ X, ˆ Y | X,Y,D for randomizingthe training set. The randomized training set, in turn, is used to ﬁt a classiﬁcation model f θ ( ˆ X, D ) thatapproximates p ˆ Y | ˆ X,D , where θ are the parameters of the model. At apply time, a new data point ( X, D ) isreceived and transformed into ( ˆ

X, D ) through a randomized mapping p ˆ X | X,D . The mapping p ˆ X | D,X is givenby marginalizing over Y, ˆ Y : p ˆ X | D,X (ˆ x | d, x ) = (cid:88) y, ˆ y p ˆ X, ˆ Y | X,Y,D (ˆ x, ˆ y | x, y, d ) p Y | X,D ( y | x, d ) . (17)Assuming that the variable D is not suppressed, and that the marginals are known, then the utility anddiscrimination guarantees set during train time still hold during apply time, as discussed in Section 3.2.However, the distortion control will inevitably change, since the mapping has been marginalized over Y .More speciﬁcally, the bound on the expected distortion for each sample becomes E (cid:104) E (cid:104) δ (( x, Y ) , ( ˆ X, ˆ Y )) | D = d, X = x, Y (cid:105) | D = d, X = x (cid:105) ≤ (cid:88) y ∈Y p Y | X,D ( y | x, d ) c x,y,d (cid:44) c x,d . (18)If the distortion control values c x,y,d are independent of y , then the upper-bound on distortion set duringtraining time still holds during apply time. Otherwise, (18) provides a bound on individual distortion atapply time. The same guarantee holds for the case when D is suppressed. We may also consider the case where the distribution p D,X,Y used to determine the transformation dif-fers from the distribution q D,X,Y of test samples. This occurs, for example, when p D,X,Y is the empiricaldistribution computed from n i.i.d. samples from an unknown distribution q D,X,Y . In this situation, discrim-ination control and utility are still guaranteed for samples drawn from q D,X,Y that are transformed using p ˆ Y , ˆ X | X,Y,D , where the latter is obtained by solving (7) with p D,X,Y . In particular, denoting by q ˆ Y | D and q ˆ X, ˆ Y the corresponding distributions for ˆ Y , ˆ X and D when q D,X,Y is transformed using p ˆ Y , ˆ X | X,Y,D , we have J (cid:16) p ˆ Y | D ( y | d ) , p Y T ( y ) (cid:17) → J (cid:16) q ˆ Y | D ( y | d ) , p Y T ( y ) (cid:17) and ∆ (cid:16) p X,Y , p ˆ X, ˆ Y (cid:17) → ∆ (cid:16) q X,Y , q ˆ X, ˆ Y (cid:17) for n suﬃcientlylarge (the distortion control constraints (5) only depend on p ˆ Y , ˆ X | X,Y,D ). The next proposition provides anestimate of the rate of this convergence in terms of n and assuming p Y,D ( y, d ) is ﬁxed and bounded awayfrom zero. Its proof can be found in the Appendix. 8 roposition 3. Let p D,X,Y be the empirical distribution obtained from n i.i.d. samples that is used todetermine the mapping p ˆ Y , ˆ X | X,Y,D , and q D,X,Y be the true distribution of the data. In addition, denote by q D, ˆ X, ˆ Y the joint distribution after applying p ˆ Y , ˆ X | X,Y,D to samples from q D,X,Y . If for all y ∈ Y , d ∈ D wehave p Y,D ( y, d ) > , J (cid:16) p ˆ Y | D ( y | d ) , p Y T ( y ) (cid:17) ≤ (cid:15) , where J is given in (3) , and ∆ (cid:16) p X,Y , p ˆ X, ˆ Y (cid:17) = (cid:88) x,y (cid:12)(cid:12)(cid:12) p X,Y ( x, y ) − p ˆ X, ˆ Y ( x, y ) (cid:12)(cid:12)(cid:12) ≤ µ, then with probability − β , J (cid:16) q ˆ Y | D ( y | d ) , p Y T ( y ) (cid:17) = (cid:15) + O (cid:18)(cid:114) n log nβ (cid:19) , (19)∆ (cid:16) q X,Y , q ˆ X, ˆ Y (cid:17) = µ + O (cid:18)(cid:114) n log nβ (cid:19) . (20)Proposition 3 guarantees that, as long as n is suﬃciently large, the utility and discrimination controlguarantees will approximately hold when p ˆ X, ˆ Y | Y,X,D is applied to fresh samples drawn from q D,X,Y . Inparticular, the utility and discrimination guarantees will converge to the ones used as parameters in theoptimization at a rate that is at least Θ (cid:16)(cid:113) n log n (cid:17) . The distortion control guarantees (5) are a propertyof the mapping p ˆ X, ˆ Y | Y,X,D , and do not depend on the distribution of the data.Observe that hidden within the big-O terms in Proposition 3 are constants that depend on the probabilityof the least likely symbol and the alphabet size. The exact characterization of these constants can be foundin the proof of the proposition in the appendix. Moreover, the upper bounds become loose if p Y,D ( y, d )can be made arbitrarily small. Thus, it is necessary to assume that p Y,D ( y, d ) is ﬁxed and bounded awayfrom zero. Moreover, if the dimensionality of the support sets of D, X and Y is large, and the number ofsamples n is limited, then a dimensionality reduction step (e.g. clustering) may be necessary in order toassure that discrimination control and utility are adequately preserved at test time. Proposition 3 and itsproof can be used to provide an explicit estimate of the required reduction. Finally, we also note that ifthere are insuﬃcient samples to reliably estimate q D,X,Y ( d, x, y ) for certain values ( d, x, y ) ∈ D × X × Y ,then, for those groups ( d, x ), it is statistically challenging to verify discrimination and thus control may notbe meaningful. We apply our proposed data transformation approach to two diﬀerent datasets to demonstrate its capabil-ities. We approximate p D,X,Y using the empirical distribution of (

D, X, Y ) in the datasets, specialize theoptimization (7) according to the needs of the application, and solve (7) using a standard convex solver(Diamond & Boyd, 2016).

Recidivism refers to a person’s relapse into criminal behavior. It has been found that about two-thirds ofprisoners in the US are re-arrested after release (Durose et al., 2014). It is important therefore to understandthe recidivistic tendencies of incarcerated individuals who are considered for release at several points in thecriminal justice system (bail hearings, parole, etc.). Automated risk scoring mechanisms have been developedfor this purpose and are currently used in courtrooms in the US, in particular the proprietary COMPAS toolby Northpointe (Northpointe Inc.).Recently, ProPublica published an article that investigates racial bias in the COMPAS algorithm (ProP-ublica, 2016), releasing an accompanying dataset that includes COMPAS risk scores, recidivism records, andother relevant attributes (ProPublica, 2017). A basic ﬁnding is that the COMPAS algorithm tends to assignhigher scores to African-American individuals, a reﬂection of the a priori higher prevalence of recidivismin this group. The article goes on to demonstrate unequal false positive and false negative rates between9able 1:

ProPublica dataset features.

Feature Values CommentsRecidivism (binary) { , } { Male, Female } Race { Caucasian, African-American } Races with small samples removedAge category { < , − , > } years of ageCharge degree { Felony, Misdemeanor } For the current arrestPrior counts { , − , > } Number of prior crimes † O b j e c t i v e V a l ue Infeasible

Objective vs. † Figure 2:

Objective vs. discrimination parameter (cid:15) for distortion constraint c = 0 . African-Americans and Caucasian-Americans, which has since been shown by Chouldechova (2016) to be anecessary consequence of the calibration of the model and the diﬀerence in a priori prevalence.In this work, our interest is not in the debate surrounding the COMPAS algorithm but rather in the under-lying recidivism data (ProPublica, 2017). Using the proposed data transformation approach, we demonstratethe technical feasibility of mitigating the disparate impact of recividism records on diﬀerent demographicgroups while also preserving utility and individual fairness. (We make no comment on the associated societalconsiderations.) From ProPublica’s dataset, we select severity of charge, number of prior crimes, and agecategory to be the decision variables ( X ). The outcome variable ( Y ) is a binary indicator of whether theindividual recidivated (re-oﬀended), and race and gender are set to be the discriminatory variables ( D ). Theencoding of the decision and discrimination variables is described in Table 1. The dataset was processed tocontain around 5k records. Speciﬁc Form of Optimization.

We specialize our general formulation in (7) by setting the utilitymeasure ∆( p X,Y , p ˆ X, ˆ Y ) to be the KL divergence D KL ( p X,Y (cid:107) p ˆ X, ˆ Y ). For discrimination control, we use (2),with J given in (3), while ﬁxing (cid:15) y,d ,d = (cid:15) . For the sake of simplicity, we use the expected distortionconstraint in (5) with c d,x,y = c uniformly. The distortion function δ in (5) has the following form. Jumps ofmore than one category in age and prior counts are heavily discouraged by setting a high distortion penalty(10 ) for such transformations. We impose the same penalty on increases in recidivism (change of Y from0 to 1). Both these choices are made to promote individual fairness. Furthermore, for every jump to thenext category for age and prior counts, a penalty of 1 is assessed, and a similar jump incurs a penalty of2 for charge degree. Reduction in recidivism (1 to 0) has a penalty of 2. The total distortion for eachindividual is the sum of squares of distortions for each attribute of X . These distortion values were chosenfor demonstration purposes to be reasonable to our judgement, and can easily be tuned according to theneeds of a practitioner. Results.

We computed the optimal objective value (i.e., KL divergence) resulting from solving (7) fordiﬀerent values of the discrimination control parameter (cid:15) , when the expected distortion constraint c = 0 . (cid:15) = 0 .

2, no feasible solution can be found that also satisﬁes the distortion constraint. Above (cid:15) = 0 . D KL ( p X,Y (cid:107) p ˆ X, ˆ Y ) = 0). In between, the optimal value varies as a smooth function (Fig. 2).10igure 3: Conditional mappings p ˆ X, ˆ Y | X,Y,D with (cid:15) = 0 .

1, and c = 0 . left ) D = (African-American , Male),less than 25 years ( X ), Y = 1, ( middle ) D = (African-American , Male), less than 25 years ( X ), Y = 0, and ( right ) D = (Caucasian , Male), less than 25 years ( X ), Y = 1. Original charge degree and prior counts ( X ) are shown invertical axis, while the transformed age category, charge degree, prior counts and recidivism ( ˆ X, ˆ Y ) are representedalong the horizontal axis. The charge degree F indicates felony and M indicates misdemeanor. Colors indicatemapping probability values. Columns included only if the sum of its values exceeds 0 . Figure 4:

Top row: Percentage recidivism rates in the original dataset as a function of charge degree, age and priorcounts for the overall population (i.e. p Y | X (1 | x )) and for diﬀerent groups ( p Y | X,D (1 | x, d )). Bottom row: Change inpercentages due to transformation, i.e. p ˆ Y | ˆ X,D (1 | x, d ) − p Y | X,D (1 | x, d ), etc. Values for cohorts of charge degree, age,and prior counts with fewer than 20 samples are not shown. The discrimination and distortion constraints are set to (cid:15) = 0 . c = 0 . We set c = 0 . (cid:15) = 0 . . Y to be independent of D , but practically itwill be controlled by the discrimination control parameter (cid:15) . The corresponding marginals p Y | D and p ˆ Y | D are illustrated in Table 2, where clearly ˆ Y is less dependent on D compared to Y . In particular, since anincrease in recidivism is heavily penalized, the net eﬀect of the randomized transformation is to decrease therecidivism risk of males, and particularly African-American males.The mapping p ˆ X, ˆ Y | X,Y,D produced by the optimization (7) can reveal important insights on the natureof disparate impact and how to mitigate it. We illustrate this by exploring p ˆ X, ˆ Y | X,Y,D for the COMPASdataset next. Fig. 3 displays the conditional mapping restricted to certain socio-demographic groups. Firstconsider young males who are African-American (left-most plot). This group has a high recidivism rate, andhence the most prominent action of the mapping (besides identity transformation) is to change the recidivism11able 2:

Dependence of the outcome variable on the discrimination variable before and after the proposed transfor-mation. F and M indicate Female and Male, and A-A, and C indicate African-American and Caucasian.

D Before transformation After transformation(gender, race) p Y | D (0 | d ) p Y | D (1 | d ) p ˆ Y | D (0 | d ) p ˆ Y | D (1 | d )F, A-A 0.607 0.393 0.607 0.393F, C 0.633 0.367 0.633 0.367M, A-A 0.407 0.593 0.596 0.404M, C 0.570 0.430 0.596 0.404 Figure 5:

Top row: High income percentages in the original dataset as a function of age and education for theoverall population (i.e. p Y | X (1 | x )) and for diﬀerent groups p Y | X,D (1 | x, d )). Bottom row: Change in percentages dueto transformation, i.e. p ˆ Y | ˆ X,D (1 | x, d ) − p Y | X (1 | x, d ), etc. Age-education pairs with fewer than 20 samples are notshown. value from 1 (recidivism) to 0 (no recidivism). The next prominent action is to change the age category fromyoung to middle aged (25 to 45 years). This eﬀectively reduces the average value of ˆ Y for young African-Americans, since the mapping for young males who are African-American and do not recidivate (middleplot) is essentially the identity mapping, with the exception of changing age category to middle aged. Thisis expected, since increasing recidivism is heavily penalized. For young Caucasian males who recidivate, theaction of the proposed transformation seems to be similar to that of young African-American males whorecidivate, i.e., the outcome variable is either changed to 0, or the age category is changed to middle age.However the probabilities of the transformations are lower since Caucasian males have, according to thedataset, a lower recidivism rate.We apply this conditional mapping on the dataset (one trial) and present the results in Fig. 4. Theoriginal percentage recidivism rates are also shown in the top panel of the plot for comparison. Because of ourconstraint that disallows changing the outcome to 1, a demographic group’s recidivism rate can (indirectly)increase only through changes to the decision variables ( X ). We note that the average percentage change inrecidivism rates across all demographics is negative when the discrimination variables are marginalized out(leftmost column). The maximum decreases in recidivism rates are observed for African-American malessince they have the highest value of p Y | D (1 | d ) (cf. Table 2). Contrast this with Caucasian females (middlecolumn), who have virtually no change in their recidivism rates since they are a priori close to the ﬁnal ones(see Table 2). Another interesting observation is that middle aged Caucasian males with 1 to 3 prior countssee an increase in percentage recidivism. This is consistent with the mapping seen in Fig. 3 (middle), and isan example of the indirect introduction of positive outcome variables in a cohort as discussed above. We apply our optimization approach to the well-known UCI Adult Dataset (Lichman, 2013) as a secondillustration of its capabilities. The features were categorized as discriminatory variables ( D ): Race (White,12inority) and Gender (Male, Female); decision variables ( X ): Age (quantized to decades) and Education(quantized to years); and response variable ( Y ): Income (binary). While the response variable consideredhere is income, the dataset could be regarded as a simpliﬁed proxy for analyzing other ﬁnancial outcomessuch as credit approvals. Speciﬁc Form of Optimization.

We use (cid:96) -distance (twice the total variation) (Pollard, 2002) tomeasure utility, ∆ (cid:16) p X,Y , p ˆ X, ˆ Y (cid:17) = (cid:80) x,y (cid:12)(cid:12)(cid:12) p X,Y ( x, y ) − p ˆ X, ˆ Y ( x, y ) (cid:12)(cid:12)(cid:12) . For discrimination control, we use (1),with J given in (3), We also set (cid:15) y,d = (cid:15) in (1). We use the distortion function in (5), and write x = ( a, e ) foran age-education pair and ˆ x = (ˆ a, ˆ e ) for a corresponding transformed pair. The distortion function returns(i) v if income is decreased, age is not changed and education is increased by at most 1 year, (ii) v if age ischanged by a decade and education is increased by at most 1 year regardless of the change of income, (iii) v if age is changed by more than a decade or education is lowered by any amount or increased by more than1 year, and (iv) 0 in all other cases. We set ( v , v , v ) = (1 , ,

3) with corresponding distance thresholdsfor δ = 0 as (0 . , . , .

9) and corresponding probabilities ( c d,x,y ) as (0 . , . ,

0) in (5). As a consequence,decreases in income, small changes in age, and small increases in education (events (i), (ii)) are permittedwith small probabilities, while larger changes in age and education (event (iii)) are not allowed at all. Wenote that the parameter settings are selected with the purpose of demonstrating our approach, and wouldchange depending on the practitioner’s requirements or guidelines.

Results.

For the remainder of the results presented here, we set (cid:15) = 0 .

15, and the optimal value ofthe utility measure ( (cid:96) distance) was 0 . X ) are plotted throughout Fig. 5 for ease of comparison, and that changes in individualpercentages may be larger than a factor of 1 ± (cid:15) because discrimination is not controlled by (1) at the levelof age-education cohorts. The top left panel indicates that income is higher for more educated and middle-aged people, as expected. The second column shows that high income percentages are signiﬁcantly lowerfor females and are accordingly increased by the transformation, most strongly for educated older womenand younger women with only 8 years of education, and less so for other younger women. Conversely, thepercentages are decreased for males but by much smaller magnitudes. Minorities receive small percentageincreases but less than for women, in part because they are a more heterogeneous group consisting of bothgenders. We proposed a ﬂexible, data-driven optimization framework for probabilistically transforming data in orderto reduce algorithmic discrimination, and applied it to two datasets. The diﬀerences between the originaland transformed datasets revealed interesting discrimination patterns, as well as corrective adjustments forcontrolling discrimination while preserving utility of the data. Despite being programmatically generated,the optimized transformation satisﬁed properties that are sensible from a socio-demographic standpoint,reducing, for example, recidivism risk for males who are African-American in the recidivism dataset, andincreasing income for well-educated females in the UCI adult dataset. The ﬂexibility of the approach allowsnumerous extensions using diﬀerent measures and constraints for utility preservation, discrimination, andindividual distortion control. Investigating such extensions, developing theoretical characterizations based onthe proposed framework, and quantifying the impact of the transformations on speciﬁc supervised learningtasks will be pursued in future work.

Appendix A Proof of Proposition 3

The proposition is a consequence of the following elementary lemma.

Lemma 1.

Let p ( x ) , q ( x ) and r ( x ) be three ﬁxed probability mass functions with the same discrete and ﬁnite upport set X , c (cid:44) min x ∈X p ( x )(1 − p ( x ))3(1+ p ( x )) > and p m (cid:44) min x p ( x ) > . Then if D KL ( p (cid:107) q ) ≤ τ ≤ c (21) and for all x ∈ X and γ ≤ p ( x ) r ( x ) ≤ γ , (22) then for all x ∈ X and g ( τ, p m ) (cid:44) (cid:113) τp m γ exp ( − g ( τ, p m )) ≤ q ( x ) r ( x ) ≤ γ exp ( g ( τ, p m )) . (23) Proof.

We assume τ >

0, otherwise p ( x ) = q ( x ) ∀ x ∈ X and we are done. From (21) and the Data ProcessingInequality for KL-divergence, for any x ∈ X p ( x ) log p ( x ) q ( x ) + (1 − p ( x )) log 1 − p ( x )1 − q ( x ) ≤ τ. (24)Let x be ﬁxed, and, in order to simplify notation, denote c (cid:44) p ( x ). Assuming, without loss of generality, q ( x ) = c exp (cid:16) − ατc (cid:17) , then (24) implies f ( α ) (cid:44) α − − cτ log (cid:32) − c exp (cid:0) − ατc (cid:1) − c (cid:33) ≤ . (25)The Taylor series of f ( α ) around 0 has the form f ( α ) = ∞ (cid:88) n =2 ( − n n ! (cid:18) τ (1 − c ) c (cid:19) n − A n − ( c ) α n , (26)where A n ( c ) is the Eulerian polynomial, which is positive for c > A ( c ) = 1 and A ( c ) = (1+ c ).First, assume α ≤

0. Then f ( α ) can be lower-bounded by the ﬁrst term in its Taylor series expansion sinceall the terms in the series are non-negative. From (25), τ α − c ) c ≤ f ( α ) ≤ . (27)Consequently, α ≥ − (cid:114) − c ) cτ . (28)Now assume α ≥

0. Then the Taylor series (26) becomes an alternating series, and f ( α ) can be lower-boundedby its ﬁrst two terms τ α − c ) c − (1 + c ) τ α − c ) c ≤ f ( α ) ≤ . (29)The term in the l.h.s. of the ﬁrst inequality satisﬁes τ α − c ) c ≤ τ α − c ) c − (1 + c ) τ α − c ) c (30)as long as α ≤ c (1 − c )(1+ c ) τ . Since the lhs is larger than 1 when α > (cid:113) − c ) cτ , then it is a valid lower-bound for f ( α ) in the entire interval where f ( α ) ≤ α ≥ (cid:114) − c ) cτ ≤ c (1 − c )(1 + c ) τ ⇔ τ ≤ c (1 − c )3(1 + c ) , (31)14hich holds by assumption in the Lemma. Thus, α ≤ (cid:114) − c ) cτ , (32)and combining the previous equation with (28) − (cid:114) − c ) cτ ≤ α ≤ (cid:114) − c ) cτ (33)Finally, since q ( x ) p ( x ) = exp( − ατ /p ( x )), from the previous inequalitiesexp (cid:32) − (cid:115) − p ( x )) τp ( x ) (cid:33) ≤ q ( x ) p ( x ) ≤ exp (cid:32)(cid:115) − p ( x )) τp ( x ) (cid:33) , (34)and the result follows by further lower bounding the lhs by γ r ( x ) ≤ p ( x ) and upper bounding the rhs by p ( x ) ≥ γ r ( x )The previous Lemma allows us to derive the result presented in Proposition 3. Proof of Proposition 3.

Let m (cid:44) |X ||Y|D| . The distribution p D,X,Y is the type (Cover & Thomas, 2006)[Chap.11] of n observations of q D,X,Y . Then , from (Csisz´ar & Shields, 2004)[Corollary 2.1], for τ > D KL ( p D,X,Y (cid:107) q D,X,Y ) ≥ τ ) ≤ (cid:18) n + m − m − (cid:19) e − nτ ≤ (cid:18) e ( n + m ) m (cid:19) m e − nτ . From the Data Processing Inequality for KL-divergence, if D KL ( p D, ˆ Y (cid:107) q D, ˆ Y ) ≤ D KL ( p D,X,Y (cid:107) q D,X,Y ), and,consequently, Pr (cid:16) D KL ( p D, ˆ Y (cid:107) q D, ˆ Y ) ≤ τ (cid:17) ≥ Pr ( D KL ( p D,X,Y (cid:107) q D,X,Y ) ≤ τ ) ≥ − (cid:18) e ( n + m ) m (cid:19) m e − nτ . If D KL ( p D, ˆ Y (cid:107) q D, ˆ Y ) ≤ τ , then since 0 ≤ D KL ( p D (cid:107) q D ), we have D KL ( p ˆ Y | D ( ·| d ) (cid:107) q ˆ Y ( ·| d )) ≤ τp D ( d ) ∀ d ∈ D . Choosing τ = 1 n log (cid:18) β (cid:18) e ( n + m ) m (cid:19) m (cid:19) , (35)then, with probability 1 − β , for all d ∈ DD KL ( p ˆ Y | D ( ·| d ) (cid:107) q ˆ Y ( ·| d )) ≤ np D ( d ) log (cid:18) β (cid:18) e ( n + m ) m (cid:19) m (cid:19) . Other bounds on the KL-divergence between an observed type and its distribution could be used, such as (Cover & Thomas,2006)[Thm. 11.2.2], without changing the asymptotic result. m , and c m (cid:44) min y ∈Y ,d ∈D p D, ˆ Y ( d, y ) > τ ≤ min d,y p ˆ Y ,D ( y,d )(1 − p ˆ Y | D ( y | d ))3(1+ p ˆ Y | D ( y | d )) ,(1 − (cid:15) ) exp ( − h ( n, β )) ≤ q ˆ Y | D ( y | d ) p Y T ( y ) (36) ≤ (1 + (cid:15) ) exp ( h ( n, β )) , (37)where h ( n, β ) (cid:44) (cid:115) nc m log (cid:18) β (cid:18) e ( n + m ) m (cid:19) m (cid:19) . (38)Observe that h ( n, β ) = Θ (cid:18)(cid:114) n log nβ (cid:19) . Since for x suﬃciently small e x ≈ x , we have (cid:12)(cid:12)(cid:12) q ˆ Y | D ( y | d ) − p Y T ( y ) (cid:12)(cid:12)(cid:12) p Y T ( y ) ≤ (cid:15) + Θ (cid:18)(cid:114) n log nβ (cid:19) , (39)proving the ﬁrst claim.For the second claim, we start by applying the triangle inequality:∆ (cid:16) q X,Y , q ˆ X, ˆ Y (cid:17) ≤ ∆ (cid:16) p X,Y , p ˆ X, ˆ Y (cid:17) + ∆ ( q X,Y , p

X,Y )+ ∆ (cid:16) q ˆ X, ˆ Y , p ˆ X, ˆ Y (cid:17) ≤ µ + ∆ ( q X,Y , p

X,Y )+ ∆ (cid:16) q ˆ X, ˆ Y , p ˆ X, ˆ Y (cid:17) . (40)Now assume D KL ( p D,X,Y (cid:107) q D,X,Y ) ≤ τ . Then the Data Processing Inequality for KL-divergence yields D KL ( p X,Y (cid:107) q X,Y ) ≤ τ and D KL ( p ˆ X, ˆ Y (cid:107) q ˆ X, ˆ Y ) ≤ τ . In addition, from Pinsker’s inequality,∆ ( q X,Y , p

X,Y ) ≤ (cid:113) D KL ( p X,Y (cid:107) q X,Y ) ≤ √ τ , and, analogously, ∆ (cid:16) q ˆ X, ˆ Y , p ˆ X, ˆ Y (cid:17) ≤ √ τ . Thus (40) becomes∆ (cid:16) q X,Y , q ˆ X, ˆ Y (cid:17) ≤ µ + 4 √ τ . (41)Selecting τ as in (35), then, with probability 1 − β ,∆ (cid:16) q X,Y , q ˆ X, ˆ Y (cid:17) ≤ µ + 4 (cid:115) n log (cid:18) β (cid:18) e ( n + m ) m (cid:19) m (cid:19) , (42)and the result follows. References

Banerjee, Arindam, Merugu, Srujana, Dhillon, Inderjit S., and Ghosh, Joydeep. Clustering with Bregmandivergences.

J. Mach. Learn. Res. , 6:1705–1749, 2005.16oyd, S. and Vandenberghe, L.

Convex Optimization . Cambridge University Press, Cambridge, UK, 2004.Calders, Toon and ˇZliobait˙e, Indr˙e. Why unbiased computational processes can lead to discriminativedecision procedures. In

Discrimination and Privacy in the Information Society , pp. 43–57. Springer, 2013.Chouldechova, Alexandra. Fair prediction with disparate impact: A study of bias in recidivism predictioninstruments. arXiv preprint arXiv:1610.07524 , 2016.Corbett-Davies, Sam, Pierson, Emma, Feller, Avi, Goel, Sharad, and Huq, Aziz. Algorithmic decision makingand the cost of fairness. arXiv preprint arXiv:1701.08230 , 2017.Cover, Thomas M. and Thomas, Joy A.

Elements of Information Theory . Wiley-Interscience, 2 edition, July2006.Csisz´ar, Imre and Shields, Paul C. Information theory and statistics: A tutorial.

Foundations and Trendsin Communications and Information Theory , 1(4):417–528, 2004.Diamond, Steven and Boyd, Stephen. CVXPY: A Python-embedded modeling language for convex opti-mization.

Journal of Machine Learning Research , 17(83):1–5, 2016.Durose, Matthew R, Cooper, Alexia D, and Snyder, Howard N. Recidivism of prisoners released in 30 statesin 2005: Patterns from 2005 to 2010.

Washington, DC: Bureau of Justice Statistics , 28, 2014.Dwork, Cynthia, Hardt, Moritz, Pitassi, Toniann, Reingold, Omer, and Zemel, Richard. Fairness throughawareness. In

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference , pp. 214–226.ACM, 2012.EEOC, The U.S. Uniform guidelines on employee selection procedures. , March 1979.Feldman, Michael, Friedler, Sorelle A, Moeller, John, Scheidegger, Carlos, and Venkatasubramanian, Suresh.Certifying and removing disparate impact. In

Proc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Min. ,pp. 259–268, 2015.Fish, Benjamin, Kun, Jeremy, and Lelkes, ´Ad´am D. A conﬁdence-based approach for balancing fairnessand accuracy. In

Proceedings of the SIAM International Conference on Data Mining , pp. 144–152. SIAM,2016.Friedler, Sorelle A, Scheidegger, Carlos, and Venkatasubramanian, Suresh. On the (im) possibility of fairness. arXiv preprint arXiv:1609.07236 , 2016.Hajian, Sara.

Simultaneous Discrimination Prevention and Privacy Protection in Data Publishing andMining . PhD thesis, Universitat Rovira i Virgili, 2013. Available online: https://arxiv.org/abs/1306.6805 .Hajian, Sara and Domingo-Ferrer, Josep. A methodology for direct and indirect discrimination preventionin data mining.

IEEE Trans. Knowl. Data Eng. , 25(7):1445–1459, 2013.Hardt, Moritz, Price, Eric, and Srebro, Nathan. Equality of opportunity in supervised learning. In

Adv.Neur. Inf. Process. Syst. 29 , pp. 3315–3323, 2016.Kamiran, Faisal and Calders, Toon. Data preprocessing techniques for classiﬁcation without discrimination.

Knowledge and Information Systems , 33(1):1–33, 2012.Kamishima, Toshihiro, Akaho, Shotaro, and Sakuma, Jun. Fairness-aware learning through regularizationapproach. In

Data Mining Workshops (ICDMW), IEEE 11th International Conference on , pp. 643–650.IEEE, 2011.Kleinberg, Jon, Mullainathan, Sendhil, and Raghavan, Manish. Inherent trade-oﬀs in the fair determinationof risk scores. In

Proc. Innov. Theoret. Comp. Sci. , 2017.17i, Ninghui, Li, Tiancheng, and Venkatasubramanian, Suresh. t-closeness: Privacy beyond k-anonymity andl-diversity. In

IEEE 23rd International Conference on Data Engineering , pp. 106–115. IEEE, 2007.Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml .Northpointe Inc. COMPAS - the most scientiﬁcally advanced risk and needs assessments. .Pedreschi, Dino, Ruggieri, Salvatore, and Turini, Franco. Discrimination-aware data mining. In

Proc. ACMSIGKDD Int. Conf. Knowl. Disc. Data Min. , pp. 560–568. ACM, 2008.Pedreschi, Dino, Ruggieri, Salvatore, and Turini, Franco. A study of top-k measures for discriminationdiscovery. In

Proc. ACM Symp. Applied Comput. , pp. 126–131, 2012.Pollard, David.

A User’s Guide to Measure Theoretic Probability

Trans. Data Privacy , 7(2):99–129, 2014.Zafar, Muhammad Bilal, Valera, Isabel, Rodriguez, Manuel Gomez, and Gummadi, Krishna P. Fairnessbeyond disparate treatment & disparate impact: Learning classiﬁcation without disparate mistreatment. arXiv preprint arXiv:1610.08452 , 2016.Zemel, Richard, Wu, Yu (Ledell), Swersky, Kevin, Pitassi, Toniann, and Dwork, Cynthia. Learning fairrepresentations. In

Proc. Int. Conf. Mach. Learn. , pp. 325–333, 2013.Zhang, Zhe and Neill, Daniel B. Identifying signiﬁcant predictive bias in classiﬁers. In

Proceedings ofthe NIPS Workshop on Interpretable Machine Learning in Complex Systems , 2016. Available online: https://arxiv.org/abs/1611.08292 .Zhao, Ming-Jie, Edakunni, Narayanan, Pocock, Adam, and Brown, Gavin. Beyond Fano’s inequality:Bounds on the optimal F-score, BER, and cost-sensitive risk and their implications.

J. Mach. Learn.Res. , 14:1033–1090, 2013.ˇZliobait˙e, Indr˙e, Kamiran, Faisal, and Calders, Toon. Handling conditional discrimination. In