Fairness in Supervised Learning: An Information Theoretic Approach
FFairness in Supervised Learning:An Information Theoretic Approach
AmirEmad Ghassami ∗ , Sajad Khodadadian ∗ , Negar Kiyavash ∗† Departments of ECE ∗ and ISE † , and Coordinated Science Laboratory ∗ ,University of Illinois at Urbana-Champaign, Urbana, USA. { ghassam2,sajadk2,kiyavash } @illinois.edu Abstract —Automated decision making systems are increasinglybeing used in real-world applications. In these systems for themost part, the decision rules are derived by minimizing thetraining error on the available historical data. Therefore, ifthere is a bias related to a sensitive attribute such as gender,race, religion, etc. in the data, say, due to cultural/historical dis-criminatory practices against a certain demographic, the systemcould continue discrimination in decisions by including the saidbias in its decision rule. We present an information theoreticframework for designing fair predictors from data, which aim toprevent discrimination against a specified sensitive attribute in asupervised learning setting. We use equalized odds as the criterionfor discrimination, which demands that the prediction should beindependent of the protected attribute conditioned on the actuallabel. To ensure fairness and generalization simultaneously, wecompress the data to an auxiliary variable, which is used for theprediction task. This auxiliary variable is chosen such that it isdecontaminated from the discriminatory attribute in the senseof equalized odds. The final predictor is obtained by applying aBayesian decision rule to the auxiliary variable.
Index Terms —Fairness, Equalized odds, Supervised learning.
I. I
NTRODUCTION
Automated decision making systems based on statisticalinference and learning are increasingly common in a widerange of real-world applications such as health care, lawenforcement, education, and finance. These systems are trainedbased on historical data, which might be biased towards certainattributes of the data points [1]–[3]. Hence, such data withoutnoticing possible biases could result in discrimination, whichis defined as gratuitous distinction between individuals withdifferent sensitive attribute. These attributes include sex, race,religion, and are referred to as protected attributes in theliterature. As an example, in the US justice system, courtsuse features of criminals such as their age, race, sex, yearsbeing in jail, etc., to estimate their possible recidivism–futurearrest. After considering these features, the court assigns ascore to each in-jail individual, and decides on whether torelease that person. If the score exceeds some certain limit, itwill be safe to release that individual. For instance, as noted byAngwin et al. analysis [4], risk scores in the criminal justicesystem–the COMPAS risk tool–are biased negatively towardsAfrican-Americans. They showed that this risk score unjus-tifiably shows high risk of recidivism for African-Americanpeople compared to what it should actually be. As anotherexample, the authors in [5] have studied the accuracy of gender representation in online image searches. The results indicatethat for instance, in a Google image search for “C.E.O.”, 11percent of the depicted results are women, even though 27percent of U.S. chef executives are women; and in a searchfor “telemarketer”, 64 percent of the people depicted werefemale, while the occupation is evenly split between men andwomen.There is an interesting connection between the problem offairness and differential privacy [6]–[8]. As in the differentialprivacy problem, one tries to hide the identity of individuals, inthe fairness problem, the goal is to hide the information aboutthe protected attribute. More details regarding this connectionis presented in [1].Different criteria for assessing discrimination has been sug-gested in the literature. The most commonly used criterion isthe so-called demographic parity , which requires the predictorto be statistically independent from the protected attribute.That is, denoting the protected attribute and the prediction by A and ˆ Y , respectively, demographic parity requires the modelto satisfy P ( A, ˆ Y ) = P ( A ) P ( ˆ Y ) . While demographic parity and its variants have been usedin several works [9]–[12], in some scenarios this criterionfails to provide fairness to all demographics [1]. For example,in the case of hiring an employee, where majority of theapplicants are from a certain demographic, if we force thedecision making system to be independent of that demo-graphic, the system has to pick equal number of applicantsfrom each demographic. Therefore, the system may admit alower qualified individual from the smaller demographic toguarantee that the percentages of hired people from differentdemographics matches. Moreover, denoting the true label by Y , in most of the cases, as in the image search example, Y is correlated with the protected attribute (see Figure 1).Therefore, as demographic parity forces ˆ Y to be independentof A , this criterion will not be satisfied for the ideal predictor ˆ Y = Y .Hardt, Price and Srebro have recently proposed equalizedodds as a new criterion of fairness [2]. This notion demandsthat the predictor should be independent of the protected at-tribute conditioned on the actual label Y . Therefore, equalized a r X i v : . [ c s . L G ] J u l dds requires the model to satisfy P ( A, ˆ Y | Y ) = P ( A | Y ) P ( ˆ Y | Y ) . (1)Returning to the example of hiring an employee, this measureimplies that among the qualified applicants , the probabilityof hiring two people from different demographics should bethe same. That is, if two people from different demographicsare both qualified, or both not qualified, the system shouldhire them with equal probability. Also, note that unlike demo-graphic parity, equalized odds allows for the ideal predictor ˆ Y = Y .In this paper, we present a new framework for designingfair predictors from data. We utilize an information theoreticapproach to model the information content of variables in thesystem relative to one another. We use equalized odds as thecriterion to assess discrimination. In our proposed scheme, adata variable X , is first mapped to an auxiliary variable U , todecontaminate it from the discriminatory attribute as well asensuring generalization. To design this auxiliary variable, forinput variable X and true label Y , we seek to find a compactrepresentation U of X that contains at most a certain levelof information about the variable X (to avoid overfitting),but maximizes I ( Y ; U ) (quality of decision). The auxiliaryvariable U is in turn used as the input for the predictiontask. Similar to [2], our framework is only based on jointstatistics of the variables rather than functional forms; hence,such a formulation is more general. Furthermore, as in manycases, the functional form of the score and underlying trainingdata are not public. Our formulation (unlike that of [2], forinstance) allows both A and Y to have arbitrary cardinality,which implies that we can have multi-level protected attributesand labels. We cast the task of finding a fair predictor asan optimization problem and propose an iterative solution forsolving this problem. We observe that the proposed solutiondoes not necessarily converge for some levels of fairness. Thissuggests that for a given requirement on the accuracy of apredictor, certain levels of fairness may not be achievable.A somewhat similar idea to our approach is presented in [9],in which the authors used an intermediate representation spacewith elements called prototypes. However, besides the fact thatin that work demographic parity is used as the measure ofdiscrimination, the method used for choosing the prototypesis quite different. Specifically, the main approach to avoidoverfitting in the learning process is limiting the number ofprototypes , while we achieve the same goal by controllingthe information in the auxiliary variable about the data. Theapproach in [9] has extended in [13] with deep variationalauto-encoders with priors that encourage independence be-tween sensitive and latent factors of variation.The rest of the paper is organized as follows. In SectionII we review the notion of equalized odds and introduceour model as well as the details of our proposed learningprocedure. Additionally, we propose the optimization that must Unfortunately, nothing is said in that work about choosing the number ofprototypes.
𝐴 𝑋 𝑌𝑈 𝑌
Fig. 1: Graphical model of the proposed framework. A , X and Y denote the protected attribute, the rest of the attributes andthe true label, respectively. U is the compressed representorof X , which is used for designing the prediction ˆ Y .be solved to address the fairness issue. In Section III wepropose an iterative approach for solving the optimizationproblem introduced. Our concluding remarks are presented inSection IV. II. M ODEL D ESCRIPTION
We consider a purely observational setting in which we traina predictor from labeled data. For each sample, we have aset of attributes, which includes protected attributes such asgender, race, religion, etc. The protected attributes are denotedby A . We use X to denote the rest of the attributes. We denotethe true label by Y and the prediction of the label Y by ˆ Y . For instance, for the example regarding risk of recidivismexplained in Section I, A represents the race of each individual, X represents other features of that individual (which could becorrelated to the individual’s race) and Y determines whetherhe/she has committed any crimes after being released from thejail.The graphical model of our setup is depicted in Figure 1. Asseen in this figure, X and A can be correlated, and given X , A is independent of the true label Y . This property is essential,otherwise, the protected attribute is in fact a direct cause of thelabel and using this attribute in the prediction process shouldnot be considered as discriminatory.In order to find a fair predictor, if the joint distribution P ( A, X, Y ) was known, we could find P ( ˆ Y | X ) close to P ( Y | X ) in the sense of equalized odds. However in realityonly the empirical distribution ˆ P ( A, X, Y ) , which is obtainedfrom data is available; therefore it is required to make surethat the predictor generalizes. Generalization : Since the number of available samplesis finite, to prevent overfitting (ensuring generalization) weshould constraint our hypothesis space. To do so, we compressour variable X to an auxiliary variable U , which in turn is usedfor the prediction task. We also choose U such that it is notcontaminated by discrimination in the sense of equalized odds[2] defined in the following. Definition 1. [Equalized odds] We say that a variable U satisfies equalized odds with respect to protected attribute A and outcome Y , if U and A are independent conditional on Y , that is, I ( A ; U | Y ) = 0 . This definition is equivalent to the one in expression (1).nce U is decontaminated from discriminatory attribute A ,one can use any predictor to predict Y from this auxiliaryvariable. We propose to apply a Bayesian empirical riskminimization decision rule in this work for the prediction task.To obtain the mechanism for generating the auxiliary vari-able, we seek for a compact representation U of X thatmaximizes the utility/quality of prediction I ( Y ; U ) , while itcontains at most a certain level of information about thevariable X . This is in essence similar to the goal in theinformation bottleneck (IB) method [14]. Maximizing I ( Y ; U ) corresponds to maximizing the utility of U , and keeping I ( X ; U ) bounded could be viewed as regularization, whichrejects complex hypotheses to ensure generalization. See [15]for a detailed discussion regarding using mutual informationfor finding bounds on generalization error. Note that the factthat we present fairness, accuracy and compactness via mutualinformation, provides us with a setting in which we do notneed to have any requirement on the cardinality of variables(as opposed to [2], [9]).Next, we present the details of designing the transitionprobability kernel for generating the auxiliary variable, as wellas designing the final predictor. A. Designing the Auxiliary Variable
As stated earlier, the goal of our learning scheme is toproduce a compressed representor of X , which has as muchinformation about the true label as possible, and is fair in thesense of Definition 1. We relax the equalized odds requirementin that we allow U to have a certain amount of informationabout the variable A conditioned on Y . The reason forthis choice will become clear in Section III. Therefore, theobjective is to find mechanism P ( U | X ) , which maximizes I ( U ; Y ) as well as1) Ensures fairness: The information shared between theprotected attribute and U given the true label does notexceed a certain threshold C , that is I ( A ; U | Y ) ≤ C.
2) Ensures generalization: The mutual information in X and U does not exceed a certain threshold D , that is I ( X ; U ) ≤ D. Therefore, we aim to solve the following optimization prob-lem. max P ( U | X ) I ( U ; Y ) s.t. I ( A ; U | Y ) ≤ C,I ( X ; U ) ≤ D. B. Designing the Predictor
As stated before, after obtaining a decontaminated variable U , this variable can be used for the prediction task. We utilizea Bayesian decision rule described in the following.Let U be the alphabet of the variable U and Y be thealphabet of variables Y and ˆ Y . To quantify the quality ofa decision, define a loss function (cid:96) : Y × Y → R + , where (cid:96) (ˆ y, y ) determines the cost of predicting ˆ y when the true labelwas y . The decisions are based on auxiliary variable U , whichis statistically related to the true label. We denote the decisionrule by δ : U → Y . The loss of the decision rule δ is definedas follows. L ( δ ) = E U,Y [ (cid:96) ( δ ( U ) , Y )] . Using L ( δ ) , the Bayesian risk minimization decision rule is δ ∗ = arg min δ L ( δ ) . For instance, for the case of binary labels with Hamming loss,defined as (cid:96) ( y, ˆ y ) = [ y (cid:54) = ˆ y ] , we have δ ∗ ( u ) = (cid:104) P ( Y = 1 | u ) ≥ P ( Y = 0 | u ) (cid:105) , which implies that we vote for the label with the maximumposterior probability.III. S OLVING THE F AIRNESS O PTIMIZATION P ROBLEM
In this Section, we propose a solution for the fairnessoptimization problem presented in Section II. The Lagrangianfor this problem will be as follows L ( P ( U | X )) = αI ( X ; U ) + βI ( A ; U | Y ) − I ( U ; Y ) , (2)where the parameters α and β determine the trade off betweenaccuracy, information compression, and fairness.Equation (2) is similar to the objective function in [16],where for given variables X , Y + , and Y − , the authorsaimed to uncover structures in P ( X, Y + ) that do not existin P ( X, Y − ) , used for hierarchical text categorization.We propose an alternating optimization method to solve theaforementioned problem. The pseudo-code of the proposedapproach is presented in Algorithm 1. In each iteration, L is reduced by minimizing objective function over three dis-tributions Q ( U | X ) , R ( U ) , and S ( Y | U ) separately. Functions f ( X, U, α, β ) and Z ( X, α, β ) are used for updating Q ( U | X ) ,which are defined as follows: Z ( x, α, β ) = (cid:88) u R ( u ) exp( f ( x, u, α, β )) , (3)and f ( x, u, α, β ) = βα (cid:88) y (cid:48) P ( y (cid:48) | x ) D ( P ( A | x ) || (cid:80) x (cid:48)(cid:48) Q ( u | x (cid:48)(cid:48) ) P ( x, y, A ) (cid:80) x (cid:48)(cid:48) Q ( u | x (cid:48)(cid:48) ) P ( x, y ) ) − α D ( P ( Y | x ) | S ( Y | u )) . (4) Theorem 1.
For values of β small enough, and any arbitraryvalue α , Algorithm 1 converges to a stationary point of theLagrangian functions L given in equation (2) . See Appendix A for a proof.In general there is no guarantee that Algorithm 1 convergesto the global minimum of the Lagrangian. Nevertheless, exper-imental results show that this altenative optimization algorithm Throughout the paper, uppercase letters for the argument of a distri-bution indicate all the parameters of the distribution, e.g., P ( U | X ) ≡{ P ( u | x ) , ∀ u, x } . lgorithm 1 Designing the conditional distribution of U . Input:
Empirical distribution ˆ P ( A, X, Y ) , initial distribu-tions Q ( U | X ) , R ( U ) , and S ( Y | U ) parameters α , β ,termination threshold (cid:15) > .Initiate L = 0 , L = (cid:15) , and t = 1 . while L t − L t − ≥ (cid:15) do Q t ( u | x ) ← R t − ( u ) Z t − ( x,α,β ) exp( f t − ( x, u, α, β )) , ∀ u, x . R t ( u ) ← (cid:80) x (cid:48) Q t ( u | x (cid:48) ) P ( x (cid:48) ) , ∀ u . S t ( y | u ) ← R t ( u ) (cid:80) x (cid:48) Q t ( u | x (cid:48) ) P ( y, x (cid:48) ) , ∀ u, y . L t +1 ← αI ( X ; U ) + βI ( A ; U | Y ) − I ( U ; Y ) . t = t + 1 . end whileOutput: Conditional distribution Q ( U | X ) .almost always converges to a local minimum of the objectivefunction in (2). Note that since achieving the global optimum isnot guaranteed, one should initiate the algorithm from severaldifferent starting distributions.The fact that convergence occurs only for a certain range ofvalues for parameter β , suggests that for a given requirementon the accuracy of a predictor, certain levels of fairness maynot be achievable. This can imply an inherent bound for thelevel of fairness that any algorithm can achieve, a conclusionwhich could have not been obtained from the other existingworks. IV. C ONCLUSION
We studied the problem of fairness in supervised learning,which is motivated by the fact that automated decision makingsystems may inherit biases related to sensitive attributes, suchas gender, race, religion, etc., from the historical data thatthey have been trained on. We presented a new frameworkfor designing fair predictors from data via an informationtheoretic machinery. Equalized odds was used as the criterionfor discrimination, which demands that the prediction shouldbe independent of the protected attribute conditioned on theactual label. In our proposed scheme, a data variable is firstmapped to an auxiliary variable to decontaminate it from thediscriminatory attribute as well as ensuring generalization.We modeled the task of designing the auxiliary variable asan optimization problem which aims to force the variable tobe fair in the sense of equalized odds and maximizes themutual information between the auxiliary variable and the truelabel, whilst keeping the information that this variable containsabout the data limited. We proposed an alternative solutionfor solving this optimization problem. We observed that theproposed solution does not necessarily converge for somelevels of fairness. This suggests that for a given requirementon the accuracy of a predictor, certain levels of fairness maynot be achievable. The final predictor is obtained by applyinga Bayesian decision rule to the auxiliary variable. Finding anexact bound on the achievable level of fairness, as well as applying the proposed method to real data is considered asour future work. A
PPENDIX AP ROOF OF T HEOREM L ( P ( U | X )) = α (cid:88) x,u P ( x ) P ( u | x ) log P ( u | x ) P ( u ) + βG ( P ( U | X ))+ (cid:88) x,u,y P ( x, y ) P ( u | x ) log P ( y | x ) P ( y | u ) − I ( X ; Y ) , (5)where G ( P ( U | X )) = I ( A ; U | Y )= (cid:88) a,u,y,x P ( u | x ) P ( a, y, x ) log (cid:80) x (cid:48) P ( u | x (cid:48) ) P ( x (cid:48) , y, a ) (cid:80) x (cid:48) P ( u | x (cid:48) ) P ( x (cid:48) , y ) . We note that, the only unknown parameters are P ( U | X ) , andall of the other distributions can be estimated from the givensamples of ( X, Y, A ) .Changing the notation of P ( u | x ) to Q ( u | x ) (to emphasize thatit is designed), and using [17, Lemma 10.8.1], we can writethe optimization as follows: min Q ( u | x ) L ( Q ( U | X )) = min Q ( u | x ) (cid:104) α (cid:88) x,u P ( x ) Q ( u | x ) log Q ( u | x ) P ( u )+ βG ( Q ( U | X )) + (cid:88) x,u,y P ( x, y ) Q ( u | x ) log P ( y | x ) P ( y | u ) (cid:105) − I ( X ; Y )= min Q ( u | x ) (cid:104) min S ( Y | U ) min R ( U ) [ α (cid:88) x,u P ( x ) Q ( u | x ) log Q ( u | x ) R ( u )+ βG ( Q ( U | X )) + (cid:88) x,u,y P ( x, y ) Q ( u | x ) log P ( y | x ) S ( y | u ) ] (cid:105) − I ( X ; Y ) , where the inner minimizations are over all probability distri-butions. Changing the order of three minimizations, we obtain min S ( Y | U ) min R ( U ) min Q ( u | x ) α (cid:88) x,u P ( x ) Q ( u | x ) log Q ( u | x ) R ( u )+ βG ( Q ( U | X )) + (cid:88) x,u,y P ( x, y ) Q ( u | x ) log P ( y | x ) S ( y | u ) − I ( X ; Y ) . (6)Since x (cid:55)→ x log x is a convex function, and summation ofa convex function with a linear function remains convex, thefirst and the third terms of equation (6) combined is a convexfunction of Q ( u | x ) , ∀ u, x . For any function G ( Q ( U | X )) ,there exist β small enough such that the combination of thefirst three terms of equation (6) remains convex with respectto each Q ( u | x ) , ∀ u, x .We add one more term λ ( x )( (cid:80) u Q ( u | x ) − , ∀ x to theLagrangian for the constraint that for each x , Q ( u | x ) shouldsum up to 1. As a result, taking the derivative of this functionwith respect to Q ( u | x ) ∀ u, x , and setting it equal to zero, theminimum of the function can be found. Below, the derivativeof each term is taken separately: = (cid:88) x (cid:48) ,u (cid:48) P ( x (cid:48) ) Q ( u (cid:48) | x (cid:48) ) log Q ( u (cid:48) | x (cid:48) ) R ( u (cid:48) ) . Therefore, ∂L ∂Q ( u | x ) = P ( x ) log Q ( u | x ) R ( u ) + (cid:88) x (cid:48) ,u (cid:48) P ( x (cid:48) ) × δ uu (cid:48) δ xx (cid:48) = P ( x ) log Q ( u | x ) R ( u ) + P ( x ) . For the second term in L we have L = I ( A ; U | Y )= (cid:88) a (cid:48) ,u (cid:48) ,y (cid:48) ,x (cid:48) P ( a (cid:48) , u (cid:48) , y (cid:48) , x (cid:48) ) log P ( u (cid:48) | a (cid:48) , y (cid:48) ) P ( u (cid:48) | y (cid:48) ) . Due to the graphical model in Figure 1, we have P ( a (cid:48) , u (cid:48) , y (cid:48) , x (cid:48) ) = P ( a (cid:48) ) P ( x (cid:48) | a (cid:48) ) Q ( u (cid:48) | x (cid:48) ) P ( y (cid:48) | x (cid:48) ) , Therefore, ∂P ( a (cid:48) , u (cid:48) , y (cid:48) , x (cid:48) ) ∂Q ( u | x ) = P ( a (cid:48) ) P ( x (cid:48) | a (cid:48) ) δ uu (cid:48) δ xx (cid:48) P ( y (cid:48) | x (cid:48) ) . The derivative of P ( u | a, y ) and P ( u | y ) can be obtainedsimilarly. Therefore, we have ∂L ∂Q ( u | x ) = (cid:88) a (cid:48) ,y (cid:48) P ( y (cid:48) , x ) P ( a (cid:48) | x ) log P ( a (cid:48) | y (cid:48) , u ) P ( a (cid:48) | y (cid:48) )= − (cid:88) y (cid:48) P ( y (cid:48) , x ) D ( P ( A | x ) || (cid:80) x (cid:48)(cid:48) Q ( u | x (cid:48)(cid:48) ) P ( x, y, A ) (cid:80) x (cid:48)(cid:48) Q ( u | x (cid:48)(cid:48) ) P ( x, y ) )+ (cid:88) y (cid:48) P ( y (cid:48) , x ) D ( P ( A | x ) || P ( A | y )) For the third term in L we have L = (cid:88) u (cid:48) ,y (cid:48) P ( u (cid:48) , y (cid:48) ) log S ( y (cid:48) | u (cid:48) ) P ( y (cid:48) ) . Therefore, ∂L ∂P ( u | x ) = − P ( x ) D ( P ( Y | x ) || S ( Y | u ))+ P ( x ) D ( P ( Y | x ) || P ( Y )) . Summing up all terms of the derivative and setting it equal tozero, we get the desired result in (3) and (4).Using the calculated Q ( u | x ) , ∀ u, x , we can minimize over R ( U ) and S ( Y | U ) . Again using [17, Lemma 10.8.1], mini-mum is achieved in marginal distributions P ( Y | U ) and P ( U ) ,which can be found from Q ( U | X ) according to Algorithm 1.Regarding convergence, we note that the Lagrangian inequation (2) could be written as follows L = α E X [ D ( P ( U | x ) || P ( U ))]+ β E A,Y [ D ( P ( U | a, y ) || P ( U | y ))]+ E X,U D ( P ( Y | x ) || P ( Y | u )) − I ( X ; Y ) . Since the first three terms of L are linear combinations ofKL-divergences, and hence non-negative, L is lower boundedby − I ( X ; Y ) which is a constant. In addition, in Algorithm1, assuming small enough β , in each of three steps of thealternating algorithm, the value of L decreases. Therefore,there exists β max , such that for values of β ≤ β max , thealgorithm converges to a stationary point of the objectivefunction in (2). A CKNOWLEDGMENT
This work was in part supported by MURI grant ARMYW911NF-15-1-0479, Navy N00014-16-1-2804 and NSF CNS17-18952. R
EFERENCES[1] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairnessthrough awareness,” in
Proceedings of the 3rd Innovations in TheoreticalComputer Science Conference . ACM, 2012, pp. 214–226.[2] M. Hardt, E. Price, N. Srebro et al. , “Equality of opportunity insupervised learning,” in
Advances in Neural Information ProcessingSystems , 2016, pp. 3315–3323.[3] L. E. Celis, D. Straszak, and N. K. Vishnoi, “Ranking with fairnessconstraints,” arXiv preprint arXiv:1704.06840 , 2017.[4] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,”
ProPublica , 2016.[5] M. Kay, C. Matuszek, and S. A. Munson, “Unequal representationand gender stereotypes in image search results for occupations,” in
Proceedings of the 33rd Annual ACM Conference on Human Factorsin Computing Systems . ACM, 2015, pp. 3819–3828.[6] C. Dwork, “Differential privacy: A survey of results,” in
InternationalConference on Theory and Applications of Models of Computation .Springer, 2008, pp. 1–19.[7] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise tosensitivity in private data analysis,” in
TCC , vol. 3876. Springer, 2006,pp. 265–284.[8] K. Kalantari, L. Sankar, and A. D. Sarwate, “Optimal differential privacymechanisms under hamming distortion for structured source classes,”in
Information Theory (ISIT), 2016 IEEE International Symposium on .IEEE, 2016, pp. 2069–2073.[9] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fairrepresentations,” in
Proceedings of the 30th International Conference onMachine Learning (ICML-13) , 2013, pp. 325–333.[10] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkata-subramanian, “Certifying and removing disparate impact,” in
Proceed-ings of the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining . ACM, 2015, pp. 259–268.[11] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi,“Fairness constraints: Mechanisms for fair classification,” arXiv preprintarXiv:1507.05259 , 2017.[12] H. Edwards and A. Storkey, “Censoring representations with an adver-sary,” arXiv preprint arXiv:1511.05897 , 2015.[13] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “Thevariational fair autoencoder,” arXiv preprint arXiv:1511.00830 , 2015.[14] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneckmethod,” in
The 37th Allerton Conference on Communication, Control,and Computing , 1999.[15] A. Xu and M. Raginsky, “Information-theoretic analysis of gener-alization capability of learning algorithms,” in
Advances in NeuralInformation Processing Systems , 2017, pp. 2521–2530.[16] G. Chechik and N. Tishby, “Extracting relevant structures with sideinformation,” in
Advances in Neural Information Processing Systems ,2003, pp. 881–888.[17] T. M. Cover and J. A. Thomas,