[PDF] A Variational Approach to Privacy and Fairness

Abstract

Full PDF

AA Variational Approach to Privacy and Fairness

Borja Rodríguez-Gálvez Ragnar Thobaben Mikael Skoglund

Division of Information Science and Engineering (ISE)KTH Royal Institute of Technology {borjarg,ragnart,skoglund}@kth.se

Abstract

In this article, we propose a new variational approach to learn private and/or fairrepresentations. This approach is based on the Lagrangians of a new formulation ofthe privacy and fairness optimization problems that we propose. In this formulation,we aim at generating representations of the data that keep a prescribed level ofthe relevant information that is not shared by the private or sensitive data, whileminimizing the remaining information they keep. The proposed approach (i)exhibits the similarities of the privacy and fairness problems, (ii) allows us to controlthe trade-off between utility and privacy or fairness through the Lagrange multiplierparameter, and (iii) can be comfortably incorporated to common representationlearning algorithms such as the VAE, the β -VAE, the VIB, or the nonlinear IB. Currently, many systems rely on machine learning algorithms to make decisions and draw inferences.That is, they use previously existing data in order to shape some stage of their decision or inferencemechanism. Usually, this data contains private or sensitive information, e.g., the identity of the personfrom which a datum was collected or their membership to a minority group. Therefore, an importantproblem occurs when the data used to train such algorithms leaks this information to the system, thuscontributing to unfair decisions or to a privacy breach.When the content of the private or sensitive information is arbitrary and the task of the system is notdeﬁned, the problem is reduced to learning private representations of the data; i.e., representationsthat are informative of the data (utility), but are not informative of the private or sensitive information.Then, these private representations can be employed by any system with a controlled leakage ofprivate information. If the level of informativeness is measured with the mutual information, theproblem of generating private representations is known as the privacy funnel (PF) [1, 2].When the task of the system is known, then the aim is to design strategies so that the system performssuch a task efﬁciently while obtaining or leaking little information about the sensitive data. Theﬁeld of algorithmic fairness has extensively studied this problem, especially for classiﬁcation tasksand categorical sensitive data; see, e.g., [3, 4, 5, 6, 7]. An interesting approach is that of learning fair representations [8, 9, 10], where similarly to their private counterparts, the representations areinformative of the task output, but are not informative of the sensitive information.There is a compromise between the information leakage and the utility when desigining privaterepresentations [2]. Similarly, in the ﬁeld of algorithmic fairness, it has been shown empirically [11]and theoretically [12] that there is a trade-off between fairness and utility.In this work, we investigate the trade-off between utility and privacy and between utility and fairnessin terms of mutual information. More speciﬁcally, we aim at maintaining a certain level of theinformation about the data (for privacy) or the task output (for fairness) that is not shared by thesenstive data, while minimizing all the other information. We name these two optimization problemsthe conditional privacy funnel (CPF) and the conditional fairness bottleneck (CFB) due to their

Preprint. Under review. a r X i v : . [ s t a t . M L ] J un imilarities with the privacy funnel [2], the information bottleneck [13], and the recent conditionalentropy bottleneck [14].We tackle both optimization problems with a variational approach based on their Lagrangian. Forthe privacy problem, we show that the minimization of the Lagrangians of the CPF and the PF isequivalent (see the supplementary material A.2), meaning that our variational approach attempts atsolving the PF as well. Moreover, this approach improves over current variational approaches to thePF by respecting the problem’s Markov chain in the encoder distribution.Finally, the resulting approaches for privacy and fairness can be implemented with little modiﬁcationto common algorithms for representation learning like the variational autoencoder (VAE) [15],the β -VAE [16], the variational information bottleneck (VIB) [17], or the nonlinear informationbottleneck [18]. Therefore, it facilitates the incorporation of private and fair representations in currentapplications (see the supplementary material B for a guide on how to modify these algorithms).We demonstrate our results both in the Adult dataset (available at [19]), which is commonly employedfor benchmarking both tasks, and a high-dimensional toy dataset, based on the MNIST hand-writtendigits dataset [20]. Further experiments can be found in the supplementary material D. In this section we give an overview of our approach. First, we introduce our proposed model forthe privacy and fairness problems. Then, we present a suitable Lagrangian formulation. Finally, wedescribe a variational approach to solving both problems.

Let us consider two random variables X ∈ X and S ∈ S such that I ( X ; S ) ≥ . The randomvariable X represents some data which is of interest to us, and the random variable S represents someprivate data. We wish to disclose the data of interest X ; however, we do not want the receiver of thisdata to draw inferences about the private data S . For this reason, we encode the data of interest X into the representation Y ∈ Y , forming the Markov chain S ↔ X → Y .This encoding, characterized by the conditional probability distribution P Y | X , is designed so that therepresentation Y keeps a certain level r of the information about the data of interest X that is notshared by the private data S (i.e., the light gray area in Figure 1a), while minimizing the informationit keeps about the private data S (i.e., the dark gray area in Figure 1a). That is, arg inf P Y | X { I ( S ; Y ) } s.t. I ( X ; Y | S ) ≥ r. (1)The main difference with the privacy funnel formulation [2] is that, even though both formulationsminimize the information the representation Y keeps about the private data S , in the PF the encodingis designed so that Y keeps a certain level r (cid:48) of the information about X , regardless if this informationis also shared by S . Hence, since I ( X ; Y | S ) ≤ I ( X ; Y ) , the restrictions of the CPF on the represen-tations are stronger, unless X ⊥ S , in which case I ( S ; Y ) = 0 and they are equal. Nevertheless, theminimization of the Lagrangians of both problems is equivalent (see the supplementary material A.2). Let us consider three random variables X ∈ X , S ∈ S , and T ∈ T such that I ( X ; S ) ≥ and I ( S ; T ) ≥ . The random variable X represents some data we want to use to draw inferences aboutthe task T . However, we do not want our inferences to be inﬂuenced by the sensitive data S . For thisreason, we ﬁrst encode the data X into a representation Y ∈ Y , which is then used to draw inferencesabout T . Therefore, the Markov chains S ↔ X → Y and T ↔ X → Y hold.This encoding, characterized by the conditional probability distribution P Y | X , is designed so that therepresentation Y keeps a certain level r of the information about the task output T that is not sharedby the private data S (i.e., the light gray area in Figure 1b), while minimizing both the information itkeeps about the private data S and the information about X that is not shared with the task output T arg inf P Y | X { I ( S ; Y ) + I ( X ; Y | S, T ) } s.t. I ( T ; Y | S ) ≥ r. (2)This formulation differs from other approaches to fairness mainly in two points: (i) Similarly to theinformation bottleneck [13], the CFB does not only minimize the information the representations Y keep about the sensitive data S , but also minimizes the information about X that is not relevant todraw inferences about T . That is, the CFB seeks representations that are both fair and relevant , thusavoiding the risk of keeping nuisances [21] and harming their generalization capability. (ii) Similarlyto the conditional entropy bottleneck [14], the CFB aims to produce representations Y that keepa certain level r of the information about the task T that is not shared by S . This differs fromformulations that aim at producing representations that maintain a certain level r (cid:48) of the informationabout T , regardless if it is also shared by the sensitive data S . (a) Conditional privacy funnel (b) Conditional information bottleneck Figure 1: Information Diagrams [22] of (a) the conditional privacy funnel and (b) the conditionalinformation bottleneck. In light gray, the relevant information we would like Y to keep. In dark gray,the useless information we would like Y to discard. A common approach to solving optimization problems such as the CPF or the CFB is to minimizethe

Lagrangian of the problem. The Lagrangian is a proxy of the trade-off between the functionto optimize and the constraints on the optimization search space [23, Chapter 5]. Particularly, theLagrangians of the CPF and the CFB are, respectively, L CPF ( P Y | X , λ ) = I ( S ; Y ) − λI ( X ; Y | S ) and (3) L CFB ( P Y | X , λ ) = I ( S ; Y ) + I ( X ; Y | S, T ) − λI ( T ; Y | S ) , (4)where λ > ∈ R is the Lagrange multiplier of the Lagrangian. This multiplier controls the trade-offbetween the information the representations Y discard and the information they keep.If we look at the information diagrams [22] of the CPF and the CFB from Figure 1, we observehow L CPF ( P Y | X , λ ) and L CFB ( P Y | X , λ ) , by means of the multiplier λ , exactly control the trade-offbetween the information we want the representations to keep (i.e., the light gray area) and all theother information (i.e., the dark gray area).In the following propositions, proved in the supplementary material A.1, we present two alternativeLagrangians that can be minimized instead of the original problem Lagrangians in order to obtain thesame result. These Lagrangians are more tractable and exhibit similar properties and structure in theprivacy and fairness problems. Proposition 1.

Minimizing L CPF ( P Y | X , λ ) is equivalent to minimizing J CPF ( P Y | X , γ ) , where γ = λ + 1 and J CPF ( P Y | X , γ ) = I ( X ; Y ) − γI ( X ; Y | S ) . (5) Proposition 2.

Minimizing L CFB ( P Y | X , λ ) is equivalent to minimizing J CFB ( P Y | X , β ) , where β = λ + 1 and J CFB ( P Y | X , β ) = I ( X ; Y ) − βI ( T ; Y | S ) . (6) Note that if λ ≤ the optimization only seeks for maximally compressed representations Y . Hence, trivialencoding distributions like a degenerate distribution P Y | X with density p Y | X = δ ( Y ) are minimizers of theLagrangian. a) Variational conditional privacy funnel (b) Variational conditional fairness bottleneck Figure 2: Graphical models of (a) the variational conditional privacy funnel and (b) the variationalconditional fairness bottleneck. The solid line represents the data density p ( S,X ) . The dashed linesrepresent the encoding density p Y | ( X,θ ) and the variational marginal density of the representations q Y | θ . The dotted lines represent (a) the generative variational density q X | ( S,Y,φ ) or (b) the inferencevariational density q T | ( S,Y,φ ) . The encoding and the (a) generative or (b) inference parameters (i.e., θ and φ , respectively) are jointly learned.We note that the minimization of J CPF ( P Y | X , γ ) and J CFB ( P Y | X , β ) , by means of γ and β , tradesoff the level of compression of the representations with the information they keep, respectively, about X and T that is not shared by S . We consider the minimization of J CPF ( P Y | X , γ ) and J CFB ( P Y | X , β ) to solve the CPF and the CFBproblems. Furthermore, we assume that the probability density function p Y | X that describes theconditional probability distribution P Y | X exists and it is parameterized by θ , i.e., p Y | X = p Y | ( X,θ ) .The Markov chains of the CPF and the CFB establish that the variables’ joint densities are factoredas p ( S,X,Y ) | θ = p ( S,X ) p Y | ( X,θ ) and p ( S,T,X,Y ) | θ = p ( S,T,X ) p Y | ( X,θ ) , respectively. The densities p ( S,X ) and p ( S,T,X ) can be inferred from the data and the density p Y | ( X,θ ) is to be designed.The term I ( X ; Y ) depends on the density p Y | θ , which is usually intractable. Similarly, the terms I ( X ; Y | S ) and I ( T ; Y | S ) depend on the densities p X | ( S,Y,θ ) and p T | ( S,Y,θ ) , respectively, which arealso usually intractable. Therefore, an exact optimization of θ would be prohibitively computationallyexpensive. For this reason, we introduce the variational density approximations q Y | θ , q X | ( S,Y,φ ) ,and q T | ( S,Y,φ ) , where the generative and inference densities are parametrized by φ . This variationalapproximation leads to the graphical models displayed in Figure 2. Then, as previously done in,e.g., [15, 24, 18], we leverage the non-negativity of the Kullback-Leibler divergence [25, Theorems2.6.3 and 8.6.1] to bound J CPF ( P Y | X , γ ) and J CFB ( P Y | X , β ) from above. More precisely, we ﬁndan upper bound on I ( X ; Y ) and a lower bound on both I ( X ; Y | S ) and I ( T ; Y | S ) , i.e., I ( X ; Y ) = E p ( X,Y ) | θ (cid:20) log (cid:18) p Y | ( X,θ ) p Y | θ (cid:19)(cid:21) ≤ E p X (cid:2) D KL (cid:0) p Y | ( X,θ ) || q Y | θ ) (cid:1)(cid:3) , (7) I ( X ; Y | S ) = E p ( S,X,Y ) | θ (cid:20) log (cid:18) p X | ( S,Y,θ ) p X | S (cid:19)(cid:21) ≥ E p ( S,X,Y ) | θ (cid:20) log (cid:18) q X | ( S,Y,φ ) p X | S (cid:19)(cid:21) , and (8) I ( T ; Y | S ) = E p ( S,T,Y ) | θ (cid:20) log (cid:18) p T | ( S,Y,θ ) p T | S (cid:19)(cid:21) ≥ E p ( S,T,Y ) | θ (cid:20) log (cid:18) q T | ( S,Y,φ ) p T | S (cid:19)(cid:21) . (9)Finally, we can jointly learn θ and φ through gradient descent. First, we note that the terms E p ( S,X ) [log p X | S ] and E p ( S,T ) [log p T | S ] do not depend on the parametrization. Second, we leveragethe reparametrization trick [15], which allow us to compute an unbiased estimate of the gradi-ents. That is, we let p Y | X dY = p E dE , where E is a random variable and Y = f ( X, E ; θ ) is adeterministic function. Lastly, we estimate p ( S,X ) and p ( S,T,X ) as the data’s empirical densities. Note that if |Y| is countable the probabiltiy density function is the probability mass function. D = (cid:8)(cid:0) x ( i ) , s ( i ) (cid:1)(cid:9) Ni =1 for the CPF or a dataset D = (cid:8)(cid:0) x ( i ) , s ( i ) , t ( i ) (cid:1)(cid:9) Ni =1 for the CFB, we minimize, respectively, the following cost functions: ˜ J CPF ( θ, φ, γ ) = 1 N N (cid:88) i =1 D KL (cid:0) p Y | ( X = x ( i ) ,θ ) || q Y | θ (cid:1) − γ E p E (cid:104) log (cid:16) q X | ( S = s ( i ) ,Y = f ( x ( i ) ,E ) ,φ ) ( x ( i ) ) (cid:17)(cid:105) (10) ˜ J CFB ( θ, φ, β ) = 1 N N (cid:88) i =1 D KL (cid:0) p Y | ( X = x ( i ) ,θ ) || q Y | θ (cid:1) − β E p E (cid:104) log (cid:16) q T | ( S = s ( i ) ,Y = f ( x ( i ) ,E ) ,φ ) ( t ( i ) ) (cid:17)(cid:105) . (11)An a posteriori interpretation of this approach is that if the encoder compresses the representations Y assuming that the decoder will use both Y and the private or sensitive data S , then the encoder willdiscard the information about S contained in the original data X in order to generate Y . Remark 1.

Note that the resulting cost functions for the CPF and the CFB ressemble those of theVAE [15], the β -VAE [16], the VIB [24], or the nonlinear IB [18]. Let us consider the (common)case that the decoder density is estimated with a neural network. If such a network is modiﬁedso that it receives as input both the representations and the private or sensitive data instead ofonly the representations, then the optimization of these algorithms results in private and/or fairrepresentations (see Appendix B for the details). If the secret information S is the identity of the samples or their membership to a certain group,the ﬁeld of differential privacy provides a theoretical framework for deﬁning privacy and severalmechanisms able to generate privacy-preserving queries about the data X and explore such data,see, e.g., [26]. If, on the other hand, the secret information is arbitrary, the theoretical frameworkintroduced in [1] is commonly adopted, with the special case of the privacy funnel [2] when the utilityand the privacy are measured with the mutual information.The original greedy algorithm to compute the PF [2] assumes the data is discrete or categorical anddoes not scale. For this reason, approaches that take advantage of the scalability of deep learning haveemerged. For instance, in [9] they learn the representations through adversarial learning, while in thethe privacy preserving variational autoencoder (PPVAE) [17] and the unsupervised version of thevariational fair autoencoder (VFAE) [27] they learn such representations with variational inference.At their core, the PPVAE and the unsupervised VFAE end up minimizing the cost functions J PPVAE ( θ, φ, η ) = E p ( S,X ) | θ [ D KL ( p Y | ( S,X,θ ) || q Y | θ ] − η − E p ( S,X,Y ) | θ [log( q X | ( S,Y,φ ) ] (12)and J VFAE ( θ, φ, δ ) = J PPVAE ( θ, φ,

1) + δ J MMD ( θ, φ ) , where J MMD ( θ, φ ) is a maximum-mean dis-crepancy term. Even though the resulting function to optimize is similar to ours, it is important tonote the encoding density in these works is p Y | ( S,X,θ ) , which does not respect the problem’s Markovchain S ↔ X → Y . Therefore, the optimization search space includes representations Y that containinformation about the private data S that is not even contained in the original data X . Moreover, theprivate data S might not be available during inference. The ﬁeld of algorithmic fairness is mainly dominated by the notions of individual fairness , where thesensitive data S is the identity of the data samples, and group fairness , where S is a binary variablethat represents the membership of the data samples to a certain group. There are several approachesthat aim at producing classiﬁers that ensure either of these notions of fairness; e.g., discrimination-freenaive Bayes [11], constrained logistic regression, hinge loss and support vector machines [5], orregularized logistic regression through the Wasserstein distance [28]. In fact, the expectation over E ∈ E is estimated with a naive Monte Carlo estimation of a single sample; i.e.,the expectation of g ( E ) : E (cid:55)→ R is estimated as E p E [ g ( E )] ≈ L (cid:80) Ll =1 g ( (cid:15) ( l ) ) , where (cid:15) ( l ) ∼ p E and L = 1 . Y sens and Y non-sens , that contain the information about the sensitivedata and the original data, respectively, necessary to draw inferences about the task T . At inferencetime, the sensitive representations Y sens are corrupted with noise or discarded, and thus the non-sensitive representations Y non-sens from [36] serve a similar purpose to the representations Y obtainedwith our approach. Compared to the variational fair autoencoder [27], our encoding density does notrequire the sensitive information S , which might not be available during inference, thus not breakingthe Markov chain S ↔ X → Y . In this section, we present experiments on two datasets to showcase the performance of the presentedvariational approach to the privacy and fairness problems. First, we show the performance of theproposed method in a dataset commonly used for benchmarking both tasks. Second, we show the per-formance on high-dimensional data on a toy dataset especially designed for this purpose. The encoderdensity is modeled with an isotropic Gaussian distribution, i.e., p Y | ( X,θ ) = N ( Y ; µ enc ( X ; θ ) , σ θ I d ) ,so that Y = µ enc ( X ; θ ) + σ θ N (0 , I d ) , where µ enc is a neural network and d is the dimensionalityof the representations. The marginal density of the representations is also modeled as an isotropicGaussian q Y | θ = N ( Y ; 0 , I d ) . Finally, the decoder density, q X | ( S,Y,φ ) or q T | ( S,Y,φ ) , is modeled witha product of categorical (for discrete data) and/or isotropic Gaussians (for continuous data), e.g., q X | ( S,Y,φ ) = Cat ( X ; ρ dec ( Y, S ; φ )) N ( X ; µ dec ( S, Y ; φ ) , σ φ ) if X consists of a discrete variable X and a continuous variable X . The experiments are detailed in the supplementary material D. Adult dataset

The Adult dataset contains , samples from the 1994 U.S. Census. Eachsample comprises 15 features such as, e.g., gender , age , income level (binary variable stating if theincome level is higher or lower than $50 , ), or education level . For both tasks, we followed theexperimental set-up from Zemel et al. [8] and considered S to be the gender and X to be the rest ofthe features. For the fairness task, we considered T to be the income level. Toy dataset: Colored MNIST

The MNIST dataset [20] is a collection of , grayscale × images of hand-written digits from to . The colored MNIST is a modiﬁcation of the former datasetwhere each digit is randomly colored in either red, green, or blue. In both tasks we considered X tobe the × × digit images, and S to be the color of the digit. Then, for the fairness task weconsidered T to be the digit number. In the privacy task, our proposed variational approach is able to control the trade-off betweenprivate and informative representations for both the Adult and the Colored MNIST datasets. Weminimized (10) for different values of γ ∈ [1 , , thus controlling the trade-off between the com-pression level I ( X ; Y ) and the informativeness of the representations independent of the private data I ( X ; Y | S ) (or equivalently − H ( X | S, Y ) ), as shown in Figures 3a and 3b. Therefore, as suggestedby Proposition 1 and shown in Figures 4a and 4b, we also control the amount of information therepresentations keep about the private data, I ( S ; Y ) , which was estimated with MINE [37].As an illustrative example, we constructed a representation of the same dimensionality, i.e., × × ,of the hand-written digits by minimizing (10) and setting γ = 1 . This representation is bothinformative and private, as shown in Figures 5b and 5d. In Figure 5d, the 2-dimensional UMAP [38]vectors of the representations are mingled with respect to their color, as opposed to the UMAP vectorsof the original images, where the vectors are clustered by the color of their images (see Figure 5c). Available at the UCI machine learning repository [19].

10 20 30 40 50 60I(X;Y)−25−24−23−22−21−20 − H ( X | S , Y ) (a)

50 100 150 200 250I(X;Y)−230−225−220−215−210 − H ( X | S , Y ) (b) I ( T ; Y | S ) (c) I ( T ; Y | S ) (d) Figure 3: Trade-off between the representations compression I ( X ; Y ) and the non-private informationretained I ( X ; Y | S ) for the (a) Adult and the (b) colored MNIST datasets with γ ∈ [1 , . Since I ( X ; Y | S ) = H ( X | S ) − H ( X | S, Y ) and H ( X | S ) does not depend on Y , the reported quantity is − H ( X | S, Y ) . Moreover, trade-off between the compression of the representations I ( X ; Y ) and thenon-sensitive information retained about the task I ( T ; Y | S ) for the (c) Adult and the (d) coloredMNIST datasets with β ∈ [1 , . The dots are the computed empirical values and the lines are themoving average of the 1D linear interpolations of such points. I ( S ; Y ) (a) I ( S ; Y ) (b) I ( S ; Y ) (c) I ( S ; Y ) (d) Figure 4: Behavior of the private information I ( S ; Y ) kept by the private representations in the (a)Adult and (b) the colored MNIST datasets with γ ∈ [1 , ; and by the fair representations in the (c)Adult and (d) the colored MNIST datasets with β ∈ [1 , . The dots are the computed empiricalvalues and the lines are the moving average of the 1D linear interpolations of such points. In the fairness task, our proposed variational approach is also able to control the trade-off between fairand accurate representations. We minimized (11) for different values of β ∈ [1 , , thus controllingthe trade-off between the compression level I ( X ; Y ) and the predictability without the sensitivevariable I ( T ; Y | S ) , as shown in Figures 3c and 3d. Moreover, as suggested by Proposition 2 andshown in Figures 4c and 4d, we also control the amount of information that the representations retainabout the sensitive data I ( S ; Y ) , which was estimated with MINE [37].Furthermore, in the Adult dataset, the Lagrange multiplier β allows us to control the behavior ofdifferent utility and group fairness indicators (deﬁned in the supplementary material E), namely theaccuracy, the error gap, and the discrimination. That is, the higher the value of β , the higher theaccuracy and the discrimination, and the lower the error gap (Figures 6a, 6b, and 6d). The behavior (a) (b) (c) (d) Figure 5: Colored MNIST (a) original data, (b) private representations, (c) original data UMAPdimensionality reduction, and (d) private representations UMAP dimensionality reduction. In theUMAP dimensionality reduction, each vector point is colored with the same color the digit was.Results obtained for γ = 1 . 7 .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0ln(β)0.720.740.760.780.800.820.840.86 A cc u r a c y ( T ) LR on YRF on YPrior LR on XRF on X (a) D i s c r i m i n a t i o n LR on YRF on Y LR on XRF on X (b) E q u a li z e d O dd s G a p LR on YRF on Y LR on XRF on X (c) E rr o r G a p LR on YRF on Y LR on XRF on X (d) A cc u r a c y ( S ) LR on YRF on YPrior LR on XRF on X (e)

Figure 6: Behavior of (a) the accuracy on T , (b) the discrimination, (c) the equalized odds gap, (d) theerror gap, and (e) the accuracy on S of a logistic regression (LR, in black) and a random forest (RF, ingray) of the fair representations (dots and dotted line) and the original data (dashed line) learned with β ∈ [1 , on the Adult dataset. The dashed line in red is the accuracy of a prior-based classiﬁer.of the discrimination is enforced by the minimization of I ( S ; Y ) , as discussed in the supplementarymaterial E. However, there is no clear indication of the effect of β on the accuracy of adversarialpredictors on the sensitive data (which is still below the prior probability of the biased training dataset)and on the equalized odds (Figures 6e and 6c). In this article, we studied the problem of mitigating the leakage of private or sensitive information S into data-driven systems through the training data X . We formalized the trade-off between therelevant information for the system that is not shared by the private or sensitive data S and theremaining information as a constrained optimization problem. When the task of the system is known,the problem is referred to as learning fair representations and the formalization is the conditionalfairness bottleneck (CFB); and when the task is unknown, the problem is referred to as learningprivate representations and the formalization is the conditional privacy funnel (CPF).We proposed a variational approach, based on the Lagrangians of the CPF and the CFB, to solve theproblem. This approach leads to a simple structure where the tasks of learning private and/or fairrepresentations can be easily identiﬁed. Moreover, in practice, private and fair representations canbe learned with little modiﬁcation to the implementation of common algorithms such as the VAE[15], the β -VAE [16], the VIB [24], or the nonlinear IB [18]. Namely, modifying the decoder neuralnetwork so it receives both the representation Y and the private or sensitive data S as an input. Then,the learned representations can be fed to any algorithm of choice. For this reason, the efforts forreducing unfair decisions and privacy breaches will be small for many practitioners. The CPF and the CFB as deﬁned in (1) and (2) are non-convex optimizationproblems with respect to P Y | X . More speciﬁcally, they are a minimization of a convex functionwith non-convex constraints (see Appendix C). Therefore, (i) the optimal conditional distribution P (cid:63)Y | X that minimizes the Lagrangian might not be achieved through gradient descent, and (ii) even if P (cid:63)Y | X is achieved, it could be a sub-optimal value for (1) or (2), since the problems are not stronglydual [23, Section 5.2.3]. A possible solution could be the application of a monotonically increasingconcave function u to I ( X ; Y | S ) or I ( T ; Y | S ) in the CPF or CFB Lagrangians, respectively, so that u ( I ( X ; Y | S )) or u ( I ( T ; Y | S )) is concave (and hence the Lagrangian is convex) in the domain ofinterest. For some u , this approach might allow to attain the desired r in (1) or (2) with a speciﬁc valueof the Lagrange multiplier; see [39] for an example of this approach for the information bottleneck. Proposed approach

The proposed approach entails two limitations that are common in variationalattempts at solving an optimization problem. Namely: (i) it approximates the decoding and themarginal distributions and (ii) it considers parametrized densities. The ﬁrst issue restricts thesearch space of the possible encoding distributions P Y | X to those distributions with a decoding andmarginal distributions that follow the restrictions of the variational approximation. The second issuefurther limits the search space to the obtainable encoding distributions with densities p Y | ( X,θ ) with aparametrization θ . For this reason, richer encoding distributions and marginals, e.g., by means ofnormalizing ﬂows [40, 41], are a possible direction to mitigate these issues.8 roader impact Privacy breaches and unfairness are concerning problems that often arise in (learning) algorithms[42, 43]. Moreover, aside from the realm of algorithms, these are problems that humans, as a society,aim to mitigate in our objective to reach sustainability [44, Goal 10]. With the present paper, wecontribute towards the development of privacy-preserving and fair systems, thus contributing to asustainable development. For example, fair decision-making algorithms could directly contributeto the UN’s targets 10.2 and 10.3 [44], namely, “empower and promote the social, economic,and political inclusion of all, irrespective of age, sex, disability, race, ethnicity, origin, religion,or economic or other status” and “ ensure equal opportunity and reduce inequalities of outcome,including by eliminating discriminatory laws, policies, and practices and promoting appropriatelegislation, policies, and action in this regard”, respectively.Even though our contribution will not help to directly level out social, political, and economicinequalities, the results and algorithms provided in this paper will help to avoid that inequalitieswill be ampliﬁed and prolonged through data-driven services and decision mechanisms (e.g., forinsurances, administration, banking, or loans). By treating the data entered into such servicesand systems fairly and conﬁdentially, as enforced by the proposed approach of this paper, ourcontribution has the potential to empower and promote social and economical inclusion [44, Target10.2] and ensure equal opportunity [44, Target 10.3] in this speciﬁc domain of people’s everydaylife. Furthermore, we believe that our results are likely to be adopted by algorithm designers andpractitioners, as our solutions can easily be integrated into existing standard representation learningalgorithms as noted in Remark 1 and detailed in the supplementary material B.

Disclosure of funding

This work was supported in part by the Swedish Research Council.

References [1] Flávio du Pin Calmon and Nadia Fawaz. Privacy against statistical inference. In

AllertonConference on Communication, Control, and Computing (Allerton) , pages 1401–1408. IEEE,2012.[2] Ali Makhdoumi, Salman Salamatian, Nadia Fawaz, and Muriel Médard. From the informationbottleneck to the privacy funnel. In

IEEE Information Theory Workshop (ITW) , pages 501–505.IEEE, 2014.[3] Toon Calders and Sicco Verwer. Three naive Bayes approaches for discrimination-free classiﬁ-cation.

Data Mining and Knowledge Discovery , 21(2):277–292, 2010.[4] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairnessthrough awareness. In

Proceedings of the 3rd Innovations in Theoretical Computer Scienceconference , pages 214–226, 2012.[5] Muhammad Bilal Zafar, Isabel Valera, Manuel Rodriguez Gomez, and Krishna P Gummadi.Fairness constraints: Mechanisms for fair classiﬁcation.

International Conference on ArtiﬁcialIntelligence and Statistics (AISTATS) , 2017.[6] Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning. arXivpreprint arXiv:1810.08810 , 2018.[7] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: A criticalreview of fair machine learning. arXiv preprint arXiv:1808.00023 , 2018.[8] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representa-tions. In

International Conference on Machine Learning (ICML) , pages 325–333, 2013.[9] Harrison Edwards and Amos Storkey. Censoring representations with an adversary.

InternationalConference on Learning Representations (ICLR) , 2016.[10] Han Zhao, Amanda Coston, Tameem Adel, and Geoffrey J Gordon. Conditional learning of fairrepresentations.

International Conference on Learning Representations (ICLR) , 2020.911] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classiﬁers with independencyconstraints. In , pages 13–18.IEEE, 2009.[12] Han Zhao and Geoff Gordon. Inherent tradeoffs in learning fair representations. In

Advances inNeural Information Processing Systems (NeurIPS) , pages 15649–15659, 2019.[13] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057 , 2000.[14] Ian Fischer. The conditional entropy bottleneck. arXiv preprint arXiv:2002.05379 , 2020.[15] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprintarXiv:1312.6114 , 2013.[16] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,Shakir Mohamed, and Alexander Lerchner. Beta-VAE: Learning basic visual concepts witha constrained variational framework.

International Conference on Learning Representations(ICLR) , 2017.[17] Lihao Nan and Dacheng Tao. Variational approach for privacy funnel optimization on continuousdata.

Journal of Parallel and Distributed Computing , 137:17–25, 2020.[18] Artemy Kolchinsky, Brendan D Tracey, and David H Wolpert. Nonlinear information bottleneck.

Entropy , 21(12):1181, 2019.[19] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .[20] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[21] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deeprepresentations.

The Journal of Machine Learning Research , 19(1):1947–1980, 2018.[22] Raymond W Yeung. A new outlook on Shannon’s information measures.

IEEE Transactionson Information Theory , 37(3):466–474, 1991.[23] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe.

Convex optimization . Cambridgeuniversity press, 2004.[24] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variationalinformation bottleneck.

International Conference on Learning Representations (ICLR) , 2016.[25] Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley & Sons,second edition edition, 2006.[26] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.

Founda-tions and Trends R (cid:13) in Theoretical Computer Science , 9(3–4):211–407, 2014.[27] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variationalfair autoencoder. International Conference on Learning Representations (ICLR) , 2016.[28] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wassersteinfair classiﬁcation.

Conference on Uncertainty in Artiﬁcial Intelligence (UAI) , 2019.[29] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, DominikJanzing, and Bernhard Schölkopf. Avoiding discrimination through causal reasoning. In

Advances in Neural Information Processing Systems (NeurIPS) , pages 656–666, 2017.[30] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In

Advances in Neural Information Processing Systems (NeurIPS) , pages 4066–4076, 2017.[31] Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. In

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence , 2018.[32] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases withadversarial learning. In

Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, andSociety , pages 335–340, 2018.[33] Yixin Wang, Dhanya Sridhar, and David M Blei. Equal opportunity and afﬁrmative action viacounterfactual predictions. arXiv preprint arXiv:1905.10870 , 2019.1034] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Fairness through causalawareness: Learning causal latent-variable models for biased data. In

Proceedings of theConference on Fairness, Accountability, and Transparency , pages 349–358, 2019.[35] Faisal Kamiran and Toon Calders. Classiﬁcation with no discrimination by preferential sampling.In

Proceedings of the 19th Machine Learning Conference of Belgium and The Netherlands ,pages 1–6. Citeseer, 2010.[36] Elliot Creager, David Madras, Joern-Henrik Jacobsen, Marissa Weis, Kevin Swersky, ToniannPitassi, and Richard Zemel. Flexibly fair representation learning by disentanglement. In

International Conference on Machine Learning (ICML) , pages 1436–1445, 2019.[37] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In

InternationalConference on Machine Learning (ICML) , pages 531–540, 2018.[38] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform manifoldapproximation and projection.

Journal of Open Source Software , 3(29), 2018.[39] Borja Rodríguez Gálvez, Ragnar Thobaben, and Mikael Skoglund. The convex informationbottleneck Lagrangian.

Entropy , 22(1):98, 2020.[40] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. arXiv preprint arXiv:1505.05770 , 2015.[41] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.Improved variational inference with inverse autoregressive ﬂow. In

Advances in Neural Infor-mation Processing Systems (NeurIPS) , pages 4743–4751, 2016.[42] Songül Tolan. Fair and unbiased algorithmic decision making: Current state and future chal-lenges. Technical report, European Commission, Joint Research Centre (JRC), 2018.[43] Cathy O’neil.

Weapons of math destruction: How big data increases inequality and threatensdemocracy . Broadway Books, 2016.[44] United Nations (UN). Transforming our world: The 2030 agenda for sustainable development.2016.[45] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[46] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Balancing learning and inferencein variational autoencoders. In

Proceedings of the Thirty-Third AAAI Conference on ArtiﬁcialIntelligence , volume 33, pages 5885–5892, 2019.[47] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In

International Conference onMachine Learning (ICML) , pages 2649–2658, 2018.[48] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources ofdisentanglement in variational autoencoders. In

Advances in Neural Information ProcessingSystems (NeurIPS) , pages 2610–2620, 2018.[49] William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstratingaccuracy equity and predictive parity.

Northpointe Inc , 2016.[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of MachineLearning Research , 12:2825–2830, 2011.[51] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In

Advances in Neural Information Processing Systems (NeurIPS) , pages 3315–3323, 2016.11 upplementary material

A Equivalences of the Lagrangians

In this section of the supplementary material, we show how minimizing the Lagrangians of the CPFand the CFB problems is equivalent to minimizing other Lagrangians. First, in A.1 we show it isequivalent to minimizing the Lagrangians that are used in the variational approach we propose in thispaper. Then, in A.2 we show that minimizing the Lagrangian of the CPF is equivalent to minimizingthe Lagrangian of the PF, meaning that the conditional probability distributions P Y | X obtained usingthe Lagrangian of the CPF would have been obtained through the Lagrangian of the PF, too. A.1 Equivalence of the Lagrangians used for the minimizationProposition 1.

Minimizing L CPF ( P Y | X , λ ) is equivalent to minimizing J CPF ( P Y | X , γ ) , where γ = λ + 1 and J CPF ( P Y | X , γ ) = I ( X ; Y ) − γI ( X ; Y | S ) . (5) Proof.

If we manipulate the expression of the CPF Lagrangian we can see how minimizing L CPF ( P Y | X , λ ) is equivalent to minimizing J CPF ( P Y | X , γ ) , where γ = λ + 1 . More speciﬁcally, arg inf P Y | X ∈P (cid:8) L CPF ( P Y | X , λ ) (cid:9) = arg inf P Y | X ∈P { I ( S ; Y ) − λI ( X ; Y | S ) } (13) = arg inf P Y | X ∈P { I ( X ; Y ) − I ( X ; Y | S ) − λI ( X ; Y | S ) } (14) = arg inf P Y | X ∈P { I ( X ; Y ) − ( λ + 1) I ( X ; Y | S ) } (15) = arg inf P Y | X ∈P (cid:8) J CPF ( P Y | X , λ + 1) (cid:9) , (16)where P is the set of probability distributions over Y such that if P Y | X ∈ P for all X ∈ X , then theMarkov chain S ↔ X → Y holds. Proposition 2.

Minimizing L CFB ( P Y | X , λ ) is equivalent to minimizing J CFB ( P Y | X , β ) , where β = λ + 1 and J CFB ( P Y | X , β ) = I ( X ; Y ) − βI ( T ; Y | S ) . (6) Proof.

If we manipulate the expression of the CFB Lagrangian we can see how minimizing L CFB ( P Y | X , λ ) is equivalent to minimizing J CFB ( P Y | X , β ) , where β = λ + 1 . More speciﬁcally, arg inf P Y | X ∈P (cid:8) L CFB ( P Y | X , λ ) (cid:9) = arg inf P Y | X ∈P { I ( S ; Y ) + I ( X ; Y | S, T ) − λI ( T ; Y | S ) } (17) = arg inf P Y | X ∈P { I ( X ; Y ) − I ( T ; Y | S ) − λI ( T ; Y | S ) } (18) = arg inf P Y | X ∈P { I ( X ; Y ) − ( λ + 1) I ( T ; Y | S ) } (19) = arg inf P Y | X ∈P (cid:8) J CFB ( P Y | X , λ + 1) (cid:9) , (20)where P is the set of probability distributions over Y such that if P Y | X ∈ P for all X ∈ X , then theMarkov chains S ↔ X → Y and T ↔ X → Y hold. A.2 Equivalence of the Lagrangians of the privacy funnel and the CPF

The privacy funnel is deﬁned in a similar way to the CPF. It is an optimization problem that tries todesign an encoding probability distribution P Y | X such that the representation Y keeps a certain level r (cid:48) of information about the data of interest X , while minimizing the information it keeps about theprivate data S [2]. That is, arg inf P Y | X { I ( S ; Y ) } s.t. I ( X ; Y ) ≥ r (cid:48) . (21)12herefore, the Lagrangian of the privacy funnel problem is L PF ( P Y | X , α ) = I ( S ; Y ) − αI ( X ; Y ) , (22)where α ∈ [0 , is the Lagrange multiplier of L PF ( P Y | X , α ) . This multiplier controls the trade-offbetween the information the representations keep about the private and the original data.

Proposition 3.

Minimizing L CPF ( P Y | X , λ ) is equivalent to minimizing L PF ( P Y | X , α ) , where α = λ/ ( λ + 1) .Proof. If we manipulate the expression of the CPF Lagrangian we can see how the minimizing L CPF ( P Y | X , λ ) is equivalent to minimizing L PF ( P Y | X , α ) , where α = λ/ ( λ + 1) . More speciﬁcally, arg inf P Y | X ∈P (cid:8) L CPF ( P Y | X , λ ) (cid:9) = arg inf P Y | X ∈P { I ( S ; Y ) − λI ( X ; Y | S ) } (23) = arg inf P Y | X ∈P { I ( S ; Y ) − λ ( I ( X ; Y ) − I ( S ; Y )) } (24) = arg inf P Y | X ∈P { ( λ + 1) I ( S ; Y ) − λI ( X ; Y ) } (25) = arg inf P Y | X ∈P (cid:26) ( λ + 1) (cid:18) I ( S ; Y ) − λλ + 1 I ( X ; Y ) (cid:19)(cid:27) (26) = arg inf P Y | X ∈P (cid:26) L PF (cid:18) P Y | X , λλ + 1 (cid:19)(cid:27) , (27)where P is the set of probability distributions over Y such that if P Y | X ∈ P for all X ∈ X , then theMarkov chain S ↔ X → Y holds.We note how the relationship α = λ/ ( λ + 1) maintains α ∈ [0 , for λ ≥ . This showcases howthe CPF poses a more restrictive problem, in the sense that as long as λ < ∞ there are no solutionsof the problem that ﬁlter private information arbitrarily. B Modiﬁcation of common algorithms to obtain private and/or fairrepresentations

In this section of the supplementary material, we discuss the simple changes needed to commonrepresentation learning algorithms to implement our proposed variational approach. First, we showhow common unsupervised learning algorithms can be modiﬁed to the variational approach to the CPF,thus generating private representations. Then, we show how common supervised learning algorithmscan be modiﬁed to the variational approach to the CFB, thus generating fair representations.

Common unsupervised learning algorithms.

The cost function of the β -VAE [16] and the VIB[24] (when the target variable is the identity of the samples) is F uns ( θ, φ, η ) = 1 N N (cid:88) i =1 D KL (cid:0) p Y | ( X = x ( i ) ,θ ) || q Y | θ (cid:1) − η − E p E (cid:104) log (cid:16) q X | ( Y = f ( x ( i ) ,E ) ,φ ) ( x ( i ) ) (cid:17)(cid:105) , (28)where η is a parameter that controls the trade-off between the compression of the representations Y and their ability to reconstruct the original data X . Similarly, the VAE [15] cost function is F uns ( θ, φ, . Comparison with the CPF.

If we compare (28) with the cost function of the CPF ˜ J CPF ( θ, φ, γ ) ,we observe that the only difference (provided that η − = γ ) is the decoding density. In the CPF thedecoding density of the original data X depends both on the representations Y and on the private data S , while in (28) it only depends on the representations Y . Therefore, the cost function F uns ( θ, φ, η ) If α = 1 , then L PF ( P Y | X ,

1) = − I ( X ; Y | S ) , for which optimal values of the encoding distribution P Y | X can ﬁlter private information arbitrarily. If α > this problem is even more pronounced. For α ≤ trivialencoding distributions like a degenerate distribution P Y | X with density p Y | X = δ ( Y ) are minimizers of theLagrangian.

13s recovered from the cost function of the CPF in the case that q X | ( S,Y,φ ) = q X | ( Y,φ ) . However, thisis not desirable, since it means that the representations Y contain all the information from the privatedata S necessary to reconstruct X . Modiﬁcations to obtain private representations.

In these usupervised learning algorithms[15, 16, 24] the decoding (or generative) density is parametrized with neural networks, e.g., q X | ( Y,φ ) = Cat ( X ; ρ dec ( Y ; φ )) if X is discrete and q X | ( Y,φ ) = N ( X ; µ dec ( Y ; φ ) , σ dec ( Y ; φ ) I d dec ) if X is continuous, where ρ dec , µ dec , and σ dec are neural networks and d dec is the dimensionalityof X . In this work, the decoding density can also be parametrized with neural networks, e.g.,Cat ( X ; ρ (cid:48) dec ( S, Y ; φ )) if X is discrete and q X | ( Y,φ ) = N ( X ; µ (cid:48) dec ( S, Y ; φ ) , σ (cid:48) dec ( S, Y ; φ ) I d dec ) if X is continuous, where ρ (cid:48) dec , µ (cid:48) dec , and σ (cid:48) dec are neural networks. Therefore, if the decoding densityneural networks from [15, 16, 24] are modiﬁed so that they take the private data S as an input, thenthe resulting algorithm is the one proposed in this paper. Common supervised learning algorithms.

The cost function of the VIB [24] and the nonlinearIB [18] is F sup ( θ, φ, η ) = 1 N N (cid:88) i =1 D KL (cid:0) p Y | ( X = x ( i ) ,θ ) || q Y | θ (cid:1) − η − E p E (cid:104) log (cid:16) q T | ( Y = f ( x ( i ) ,E ) ,φ ) ( t ( i ) ) (cid:17)(cid:105) , (29)where η is a parameter that controls the trade-off between the compression of the representations Y and their ability to draw inferences about the task T . Comparison with the CFB.

Similarly to the comparison of the unsupervised learning algorithmsand the CPF, we observe that (29) only differs with the cost function ˜ J CFB ( θ, φ, β ) in the decod-ing density, i.e., the cost function F sup ( θ, φ, η ) can be recovered from ˜ J CFB ( θ, φ, β ) by setting q T | ( S,Y,φ ) = q T | ( Y,φ ) . The inference density of the task T only depends on the representations Y in[24, 18], while in our work it depends both on the representations Y and the sensitive data S . Hence,in these works the representations contain all the information from the sensitive data S necessary todraw inferences about the task T . Modiﬁcations to obtain fair representations.

The argument is analogous to the one for themodiﬁcations of unsupervised learning algorithms to obtain private representations. The onlymodiﬁcation required in these supervised learning algorithms [24, 18] is to modify the decodingdensity neural networks to receive the sensitive data S as an input as well as the representations Y . Invariants of the algorithms.

In all these works [15, 16, 24, 18] and ours, the ﬁrst (or the com-pression) term is usually calculated assuming that the encoder density is parametrized with neu-ral networks, e.g., p Y | ( X,θ ) = N ( Y ; µ enc ( X ; θ ) , σ enc ( X ; θ ) I d ) , which allows the representationsto be constructed using the reparametrization trick, e.g., Y = µ enc ( X ; θ ) + σ enc ( X ; θ ) E , where E ∼ N (0 , I d ) , d is the dimensionality of the representations, and I d is the d -dimensional identitymatrix. Then, the marginal density of the representations is set so that the Kullback-Leibler diver-gence has either a closed expression, a simple way to estimate it, or a simple upper bound, e.g., q Y | θ = N ( Y ; 0 , I d ) or q Y | θ = N (cid:80) Ni =1 p Y | ( X = x ( i ) ,θ ) , where x ( i ) are the input data samples. More-over, the loss function applied to the output of the decoding density and the optimization algorithm,e.g., stochastic gradient descent or Adam [45], can remain the same in these works and ours, too. Remark 2.

The aforementioned modiﬁcations can also be introduced in other algorithms with costfunctions with additional terms to F uns and F sup . For example, adding a maximum-mean discrepancy(MMD) term on the represenation priors to avoid the information preference problem like in theInfoVAE [46]; adding an MMD term on the encoder densities to enforce privacy or fairness likein the VFAE [27]; or adding a total correlation penalty to the representation’s marginal to enforcedisentangled represenations like in the Factor-VAE, the β -TCVAE, or the FFVAE [47, 48, 36]. C Non-convexity of the CPF and the CFB

In this section of the supplementary material, we show how both the CPF and the CFB as deﬁnedin (1) and (2) are non-convex optimization problems.14 emma 1.

Let us consider that the distributions of S and X are ﬁxed and that the conditionaldistribution P Y | X has a density p Y | X . Then, the CPF optimization problem is not convex.Proof. From Lemma 1 we know that I ( S ; Y ) and I ( X ; Y | S ) are convex functions with respect to p Y | X for ﬁxed p S and p X | S . Hence, the constraint I ( X ; Y | S ) ≥ r is concave. Proposition 5.

Let us consider that the distributions of S , T , and X are ﬁxed and that the conditionaldistribution P Y | X has a density p Y | X . Then, the CFB optimization problem is not convex.Proof. From Lemma 1 we know that I ( S ; Y ) , I ( X ; Y | S, T ) , and I ( T ; Y | S ) are convex functionswith respect to p Y | X for ﬁxed p S , p X | ( S,T ) , and p T | S . Hence, the constraint I ( T ; Y | S ) ≥ r isconcave. D Details of the experiments

In this section of the supplementary material, we include an additional experiment on the COMPASdataset [49] and describe the details of the experiments performed to validate the approach proposedin this paper.

D.1 Results on the COMPAS datasetCOMPAS dataset.

The ProPublica COMPAS dataset [49] contains , samples of differentattributes of criminal defendants in order to classify if they will recidivate within two years or not.These attributes include gender , age , or race . In both tasks, we followed the experimental set-up fromZhao et al. [10] and considered S to be a binary variable stating if the defendant is African Americanand X to be the rest of attributes. For the fairness task, we considered T to be the binary variablestating if the defendant recidivated or not. Since this dataset was not previously divided betweentraining and test set, we randomly splitted the dataset with % of the samples ( , ) for trainingand the rest ( , ) for testing. I ( X ; Y ) and the informativeness of the representationsindependent of the private data I ( X ; Y | S ) and between the compression level I ( X ; Y ) and thepredictability of the representations without the sensitive data I ( X ; Y | S ) is controlled by the privateand the fair representations, respectively. Moreover, we can also see how the amount of informationthe representations keep about the private or the sensitive data is commanded by the Lagrangemultipliers γ and β .Furthermore, the Lagrange multiplier β also allows us to control the behavior of the accuracy, theerror gap, and the discrimination for the COMPAS dataset (Figures 8a, 8d, and 8b). Moreover, in thisscenario, as shown in Figures 8c and 8e, an increase of β also increased the equalized odds level andthe accuracy on S of adversarial classiﬁers (even though they remained below their values obtainedwith the original data X for all the β tested). These results on the equalized odds, even though notgeneralizable since we have the counter-example of the Adult dataset, indicate that in some situationsthis quanitty can be controlled with our approach. More speciﬁcally, we believe this happens whenwe can guarantee that I ( S ; Y ; T ) is non-negative as explained in Remark 4. − H ( X | S , Y ) (a) I ( S ; Y ) (b) I ( T ; Y | S ) (c) I ( S ; Y ) (d) Figure 7: Trade-off between (a) the private representations compression I ( X ; Y ) and the non-private information retained I ( X ; Y | S ) and (b) the Lagrange multiplier γ ∈ [1 , and the privateinformation I ( S ; Y ) kept by the representations. Since I ( X ; Y | S ) = H ( X | S ) − H ( X | S, Y ) and H ( X | S ) does not depend on Y , the reported quantity is − H ( X | S, Y ) . Moreover, trade-off between(c) the private representations compression I ( X ; Y ) and the non-sensitive information retained aboutthe task I ( T ; Y | S ) and (d) the Lagrange multiplier β ∈ [1 , and the sensitive information kept bythe representations. All quantities are obtained for the COMPAS dataset. The dots are the computedempirical values and the lines are the moving average of the 1D linear interpolations of such points. A cc u r a c y ( T ) LR on YRF on YPrior LR on XRF on X (a) D i s c r i m i n a t i o n LR on YRF on Y LR on XRF on X (b) E q u a li z e d O dd s LR on YRF on Y LR on XRF on X (c) E rr o r G a p LR on YRF on Y LR on XRF on X (d) A cc u r a c y ( S ) LR on YRF on YPrior LR on XRF on X (e)

Figure 8: Behavior of (a) the accuracy on T , (b) the discrimination, (c) the equalized odds gap, (d) theerror gap, and (e) the accuracy on S of a logistic regression (LR, in black) and a random forest (RF, ingray) of the fair representations (dots and dotted line) and the original data (dashed line) learned with β ∈ [1 , on the COMPAS dataset. The dashed line in red is the accuracy of a prior-based classiﬁer. D.2 Experimental detailsEncoders.

In all the experiments performed, we modeled the encoding density as an isotropicGaussian distribution, i.e., p Y | ( X,θ ) = N ( Y ; µ enc ( X ; θ ) , σ θ I ) , so that Y = µ enc ( X ; θ ) + σ θ E ,where E ∼ N (0 , I ) , µ enc is a neural network, σ θ is also optimized via gradient descent but is notcalculated with X as an input, and where the representations have dimensions. The neural networksin each experiment were: • For the Adult dataset, µ enc was a multi-layer perceptron with a single hidden layer with 100units and ReLU activations. 16able 1: Convolutional neural network architectures employed for the Colored MNIST dataset.The network modules are the following: Conv2D(cin,cout,ksize,stride,pin,pout) and

ConvTrans2D(cin,cout,ksize,stride,pin,pout) represent, respectively, a 2D convolutionand transposed convolution, where cin is the number of input channels, cout is the number ofoutput channels, ksize is the size of the ﬁlters, stride is the stride of the convolution, pin is the input padding of the convolution, and pout is the ouptut padding of the convolution;

MaxPool2D(ksize,stride,pin) represents a max-pooling layer, where the variables mean thesame than for the convolutional layers;

Linear(nu) represents a linear layer, where nu are the outputunits; and BatchNorm , ReLU6 , Flatten , Unflatten , and

Sigmoid represent a batch normalization,ReLU6, ﬂatten, unﬂatten and Sigmoid layers, respectively.Name Architecture

CNN-enc-1 Conv2D(3,5,5,2,1,0) - BatchNorm - RelU6 -Conv2D(5,50,5,2,0,0) - BatchNorm - RelU6 - Flatten -Linear(100) - BatchNorm - ReLU - Linear(2)CNN-enc-2 Conv2D(3,5,5,0,2,0) - BatchNorm - RelU6 -Conv2D(3,5,5,0,2,0) - BatchNorm - RelU6 - Conv2D(3,5,5,0,2,0)CNN-dec-1 Linear(100) - BatchNorm - ReLU6 - Linear(1250) - Unflatten -BatchNorm - ReLU - ConvTrans2D(50,5,5,2,0,0) - BatchNorm -ReLU - ConvTrans2D(5,3,5,2,1,1) - SigmoidCNN-dec-2 Conv2D(3,5,5,0,2) - BatchNorm - RelU6 - Conv2D(5,50,5,0,2,0) -BatchNorm - RelU6 - Conv2D(5,50,5,0,2,0) - BatchNorm - RelU6 -Conv2D(5,50,5,0,2,0) - SigmoidCNN-mine Conv2D(3,5,5,1,1,0) - MaxPool2D(5,2,2) - RelU6 -Conv2D(5,50,5,1,0,0) - MaxPool2D(5,2,2) - RelU6 - Flatten -Linear(100) - ReLU6 - Linear(50) - ReLU6 - Linear(1) • For the Colored MNIST dataset, µ enc was the convolutional neural network CNN-enc-1 forboth the privacy and fairness experiments, and the convolutional neural network

CNN-enc-2 for the example from Figure 5. Both architectures are described in Table 1. • For the COMPAS dataset, µ enc was a multi-layer perceptron with a single hidden layer with100 units and ReLU activations.Moreover, the marginal density of the representations was modeled as an isotropic Gaussian of unitvariance and zero mean; i.e., q Y | θ = N ( Y ; 0 , I ) . Decoders.

In all the experiments performed for the fairness problem, the target task variable T was binary. Hence, we modeled the inference density with a Bernoulli distribution ; i.e., q T | ( S,X,φ ) = Bernoulli ( T ; ρ dec ( S, Y ; φ )) , where ρ dec is a neural network with a Sigmoid ac-tivation function in the output. In the privacy problem, if X was a collection of randomvariables ( X , X , ..., X C ) , the generative density was modeled as the product of C categori-cal and isotropic Gaussians, depending if the variables were discrete or continuous. That is, q X | ( S,Y,φ ) = (cid:81) Cj =1 Cat ( X j ; ρ dec ,j ( S, Y ; φ )) I [ X j = Discrete ] N ( X j ; µ dec ,j ( S, Y ; φ ) , σ φ,j ) I [ X j = Continuous ] ,where the continuous random variables are -dimensional. In practice, the densities were parametrizedwith a neural network with K = (cid:80) Cj =1 K I [ X j = Discrete ] j I [ X j = Continuous ] output neurons, where K j arethe number of classes of the categorical variable X j and each group of output neurons deﬁnes eachmultiplying density; either as the logits of the K j classes or the parameter (mean) of Gaussian (thevariances were also optimized via gradient descent but were not calculated with S nor Y as an input).If X was an image, the generative density was modeled as a product of C Bernoulli densities, where C is the number of pixels of the image and the comes from the RGB channels. The neural networksin each experiments were: • For the Adult dataset, the decoding neural network was a multi-layer perceptron with asingle hidden layer with 100 units and ReLU activations. For the fairness task, the outputwas 1-dimensional with a Sigmoid activation function. For the privacy task, the output was -dimensional. The input of the network was a concatenation of Y and S . Note that the Bernoulli distribution is a categorical distribution with two possible outcomes. β or γ ) Batch sizeAdult (fairness) 150 − − − − − − − · − · − · − · − · − · − • For the Colored MNIST dataset and the fairness task, the decoding neural network was alsoa multi-layer perceptron with a single hidden layer with 100 units, ReLU activations, and a1-dimensional output with a Sigmoid activation function. For the privacy task, the decodingneural network was the

CNN-dec-1 for the normal experiments and the

CNN-dec-2 for theexample of Figure 5. The input linear layers took as an input a concatenation of Y and S and in the convolutional layers S was introduced as a bias. • For the COMPAS dataset, the decoding neural network also was a multi-layer perceptronwith a single hidden layer with 100 units and ReLU activations. For the fairness task, theoutput was 1-dimensional with a Sigmoid activation function. For the privacy task, theoutput was -dimensional. The input of the network was a concatenation of Y and S . Hyperparameters.

The hyperparameters employed in the experiments to train the encoder anddecoder networks are displayed in Table 2, and the optimization algorithm used was Adam [45].

Preprocessing.

The input data X for the Adult and the COMPAS dataset was normalized to have 0mean and unit variance. The input data for the Colored MNIST dataset was scaled to the range [0 , . Information measures.

The mutual information I ( X ; Y ) and the conditional entropy I ( X ; Y | S ) and I ( T ; Y | S ) were calculated with the bounds from (7), (8), and (9), respectively. Since H ( X | S ) was not directly obtainable, − H ( X | S, Y ) was calculated and displayed instead. The mutual informa-tion I ( S ; Y ) was calculated using the mutual information neural estimator (MINE) with a movingaverage bias corrector with a an exponential rate of 0.1 [37]; the resulting information was averagedover the last 100 iterations. The neural networks employed were a 2-hidden layer multi-layer percep-tron with 100 ReLU6 activation functions for all the datasets and tasks, except from the example onthe Colored MNIST dataset, where the CNN-mine from Table 1 was used. In all tasks the input wasa concatenation of S and Y , except from the example on the Colored MNIST dataset, where in allconvolutional layers the private data S was added as a bias and in all linear layers S was concatenatedto the input. The hyperparameters used to train the networks are displayed in Table 3. Group fairness and utility indicators.

The accuracy on T and S , the discrimination, and the errorand equalized odds gaps were calculated using both the input data X and the generated representations Y . They were calculated with a Logistic regression (LR) classiﬁer and a random forest (RF) classiﬁerwith the default settings from scikit-learn [50]. The prior displayed on the accuracy on T and S ﬁgures is the accuracy of a classiﬁer that only infers the majority class of T and S , respectively, fromthe training dataset. 18 Group fairness and utility indicators

In this section of the supplementary material, we deﬁne and put into perspective a series of metrics,employed in this article, that indicate the predicting and group fairness quality of a classiﬁer.

Notation

Let X ∈ X , S ∈ S , and T ∈ T be random variables denoting the input data, thesensitive data, and the target task data, respectively. Let also X ∈ R d X , S = { , } , and T = { , } .Let w ∈ W , w : X (cid:55)→ T be a classiﬁer; that is, w receives as an input an instance of the inputdata x ∈ X and outputs an inference about the target task value t ∈ T for that input data. Let usalso consider the setting where we have a dataset that contains N samples of the random variables D = { ( x ( i ) , s ( i ) , t ( i ) ) } Ni =1 . Finally, let ˆ P denote the empirical probability distribution on the dataset D , ˆ P S = σ the empirical probability distribution on the subset of the dataset D where s ( i ) = σ , i.e., { ( x, s, t ) ∈ D : s = σ } , and ˆ P ( S = σ,T = τ ) the empirical probability distribution on the subset of thedataset D where s ( i ) = σ and t ( i ) = τ , i.e., { ( x, s, t ) ∈ D : s = σ and t = τ } .A common metric to evaluate the performance (utility) of a classiﬁer w on a dataset is its accuracy ,which measures the fraction of correct classiﬁcations of the classiﬁer on such a dataset. Deﬁnition 1.

The accuracy of a classiﬁer w on a dataset D is Accuracy ( w, D ) = ˆ P ( w ( X ) = T ) . (30)An ideally fair classiﬁer w would maintain demographic parity (or statistical parity ) and accuracyparity , which, respectively, mean that w ( X ) ⊥ S (or, equivalently if w is deterministic, that ˆ P S =0 ( w ( X ) = 1) = ˆ P S =1 ( w ( X ) = 1) ) and that ˆ P S =0 ( w ( X ) (cid:54) = T ) = ˆ P S =1 ( w ( X ) (cid:54) = T ) [10]. Inother words, if a classiﬁer has demographic parity, it means that it gives a positive outcome with equalrate to the members of S = 0 and S = 1 . However, demographic parity might damage the desiredutility of the classiﬁer [51], [12, Corollary 3.3]. Accuracy parity, on the contrary, allows the existanceof perfect classiﬁers [10]. The metrics that assess the deviation of a classiﬁer from demographic andaccuracy parities are the discrimination or demographic parity gap [8, 10] and the error gap [10]. Deﬁnition 2.

The discrimination or demographic parity gap of a classiﬁer w to the sensitive variable S on a dataset D is Discrimination ( w, D ) = (cid:12)(cid:12)(cid:12) ˆ P S =0 ( w ( X ) = 1) − ˆ P S =1 ( w ( X ) = 1) (cid:12)(cid:12)(cid:12) . (31) Deﬁnition 3.

The error gap of a classiﬁer w with respect to the sensitive variable S on a dataset D is Error gap ( w, D ) = (cid:12)(cid:12)(cid:12) ˆ P S =0 ( w ( X ) (cid:54) = T ) − ˆ P S =1 ( w ( X ) (cid:54) = T ) (cid:12)(cid:12)(cid:12) . (32)Another advanced notion of fairness is that of equalized odds or positive rate parity , which meansthat ˆ P ( S =0 ,T = τ ) ( w ( X ) = 1) = ˆ P ( S =1 ,T = τ ) ( w ( X ) = 1) , for all τ ∈ { , } or, equivalently, that w ( X ) ⊥ S | T [51]. This notion of fairness requires that the true positive and false positive rates ofthe groups S = 0 and S = 1 are equal. The metric that assesses the deviation of a classiﬁer fromequalized odds is the equalized odds gap [10]. Deﬁnition 4.

The equalized odds gap of a classiﬁer w with respect to the sensitive variable S on adataset D is Equalized odds gap ( w, D ) = max τ ∈{ , } (cid:12)(cid:12)(cid:12) ˆ P ( S =0 ,T = τ ) ( w ( X ) = 1) − ˆ P ( S =1 ,T = τ ) ( w ( X ) = 1) (cid:12)(cid:12)(cid:12) . (33) Remark 3.

In the particular case of learning fair representations, the classiﬁer w : X (cid:55)→ T consistsof two stages: an encoder w enc : X (cid:55)→ Y and a decoder w dec : Y (cid:55)→ T , where the intermediatevariable Y = w enc ( X ) is the fair representation of the data. Therefore:1. Minimizing I ( S ; Y ) encourages demographic parity, since I ( S ; Y ) = 0 ⇐⇒ Y ⊥ S = ⇒ w ( X ) ⊥ S. (34)

2. Minimizing I ( S ; Y | T ) encourages equalized odds, since I ( S ; Y | T ) = 0 ⇐⇒ Y ⊥ S | T = ⇒ w ( X ) ⊥ S | T. (35)19 emark 4. Based on Remark 3, we note that the variational approach to the CFB and the CPF forgenerating private and/or fair representations encourages demographic parity, since the minimizationof the Lagrangians of such problems, L CFB and L CPF , indeed minimizes I ( S ; Y ) .However, we cannot say the same for equalized odds. Even though I ( S ; Y ) = I ( S ; Y | T ) + I ( S ; Y ; T ) , since I ( S ; Y ; T ) can be negative [22], then I ( S ; Y ) is not necessarily greater than I ( S ; Y | T ) and thus there is no guarantee that minimizing I ( S ; Y ) will minimize I ( S ; Y | T ) as well.as well.