Oblivious Data for Fairness with Kernels
OOblivious Data for Fairness with Kernels
Steffen Grünewälder S . GRUNEWALDER @ LANCASTER . AC . UK Azadeh Khaleghi A . KHALEGHI @ LANCASTER . AC . UK Department of Mathematics and StatisticsLancaster UniversityLancaster, UK
Abstract
We investigate the problem of algorithmic fairness in the case where sensitive and non-sensitivefeatures are available and one aims to generate new, ‘oblivious’, features that closely approximatethe non-sensitive features, and are only minimally dependent on the sensitive ones. We study thisquestion in the context of kernel methods. We analyze a relaxed version of the Maximum MeanDiscrepancy criterion which does not guarantee full independence but makes the optimizationproblem tractable. We derive a closed-form solution for this relaxed optimization problem andcomplement the result with a study of the dependencies between the newly generated features and thesensitive ones. Our key ingredient for generating such oblivious features is a Hilbert-space-valuedconditional expectation, which needs to be estimated from data. We propose a plug-in approachand demonstrate how the estimation errors can be controlled. While our techniques help reduce thebias, we would like to point out that no post-processing of any dataset could possibly serve as analternative to well-designed experiments.
Keywords:
Algorithmic Fairness, Kernel Methods
1. Introduction
Machine learning algorithms trained on historical data may inherit implicit biases which can inturn lead to potentially unfair outcomes for some individuals or minority groups. For instance,gender-bias may be present in a historical dataset on which a model is trained to automate thepostgraduate admission process at a university. This may in turn render the algorithm biased, leadingit to inadvertently generate unfair decisions. In recent years, a large body of work has been dedicatedto systematically addressing this problem, whereby various notions of fairness have been considered,see, e.g. (Calders et al., 2009; Zemel et al., 2013; Louizos et al., 2015; Hardt et al., 2016; Joseph et al.,2016; Kilbertus et al., 2017; Kusner et al., 2017; Calmon et al., 2017; Zafar et al., 2017; Kleinberget al., 2017; Donini et al., 2018; Madras et al., 2018), and references therein.Among the several algorithmic fairness criteria, one important objective is to ensure that amodel’s prediction is not influenced by the presence of sensitive information in the data. In thispaper, we address this objective from the perspective of (fair) representation learning. Thus, a centralquestion which forms the basis of our work is as follows.
Can the observed features be replaced by close approximationsthat are independent of the sensitive ones?
More formally, assume that we have a dataset such that each data-point is a realization of a randomvariable ( X, S ) where S and X are in turn vector-valued random variables corresponding to thesensitive and non-sensitive features respectively. We further allow X and S to be arbitrarily dependent, © .License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ s t a t . M L ] N ov RÜNEWÄLDER AND K HALEGHI and ask whether it is possible to generate a new random variable Z which is ideally independent of S and close to X in some meaningful probabilistic sense . This objective is As an initial step, wemay assume that X is zero-mean, and aim for decorrelation between Z and X . This can be achievedby letting Z = X − E S X where E S X is the conditional expectation of X given S . The randomvariable Z so-defined is not correlated with S and is close to X . In particular, it recovers X if X and S are independent. In fact, under mild assumptions, Z gives the best approximation (in themean-squared sense) of X , while being uncorrelated with S . Observe that while the distribution of Z differs from that of X , this new random variable seems to serve the purpose well. For instance, if S corresponds to a subject’s gender and X to a subject’s height , then Z corresponds to height of thesubject centered around the average height of the class corresponding to the subject’s gender. The keycontributions of this work, briefly summarized below, are theoretical; we also provide an evaluationof the proposed approach through experiments in the context of classification and regression . Beforegiving an overview of our results, we would also like to point out that while our techniques helpreduce the bias, it is important to note that no post-processing of any dataset could possibly serve asan alternative to well-designed experiments. Contributions.
Building upon this intuition, and using results inspired by testing for independenceusing the Maximum Mean Discrepancy (MMD) criterion (see e.g. Gretton et al. (2008)), we obtain arelated optimization problem in which X and E S X are replaced with Hilbert-space-valued randomvariables and Hilbert-space-valued conditional expectations. While the move to Hilbert spaces doesnot enforce complete independence between the new features and the sensitive features, it helpsto significantly reduce the dependencies between the features. The new features Z have varioususeful properties which we explore in this paper. They are also easy to generate from samples ( X , S ) , . . . , ( X n , S n ) . The main challenge in generating the oblivious features Z , . . . , Z n is thatwe do not have access to the Hilbert-space-valued conditional expectation and need to estimate itfrom data. Since we are concerned with Reproducing Kernel Hilbert Spaces (RKHSs) here, weuse the reproducing property to extend the plugin approach of Grünewälder (2018) to the RKHSsetting and tackle the estimation problem. We further show how estimation errors can be controlled.Having obtained the empirical estimates of the conditional expectations, we generate obliviousfeatures and an oblivious kernel matrix to be used as input to any kernel method. This guaranteesa significant reduction in the dependence between the predictions and the sensitive features. Wecast the objective of finding oblivious features Z which approximate the original features X wellwhile maintaining minimal dependence on the sensitive features S , as a constrained optimizationproblem. Making use of Hilbert-space-valued conditional expectations, we provide a closed formsolution to the optimization problem proposed. Specifically, we first prove in that our solutionsatisfies the constraint of the optimization problem at hand, and show via Proposition 4 that itis indeed optimal. Through Proposition 2 we relate the strength of the dependencies between Z and S to how close Z lies to the low-dimensional manifold corresponding to the image under thefeature map φ . This result is key in providing some insight into the interplay between probabilisticindependence and approximations in the Hilbert space. We extend known estimators for real-valuedconditional expectations to estimate those taking values in a Hilbert space, and show via Proposition5 how to control their estimation errors. This result in itself may be of independent interest in futureresearch concerning Hilbert-space-valued conditional expectations. We provide a method to generateoblivious features and the oblivious kernel matrix which can be used instead of the kernel matrix to
1. Our implementations are available at https://github.com/azalk/Oblivious.git . BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS reduce the dependence of the prediction on the sensitive features; the computational complexity ofthe approach is O ( n ) . Related Work.
Among the vast literature on algorithmic fairness, Donini et al. (2018); Madraset al. (2018), which fit into the larger body of work on fair representation learning, are closest toour approach. Madras et al. (2018) describe a general framework for fair representation learning.The approach taken is inspired by generative adversarial networks and is based on a game playedbetween generative models and adversarial evaluations. Depending on which function classes oneconsiders for the generative models and for the adversarial evaluations one can describe a vast arrayof approaches. Interestingly, it is possible to interpret our approach in this general context: theencoder f corresponds to a map from X and S to H , where our new features Z live. We do not havea decoder but compare features directly (one could also take our decoder to be the identity map).Our adversary is different from that used by Madras et al. (2018). In their approach a regressor isinferred which maps the features to the sensitive features, while we compare sensitive features andnew features by applying test functions to them. The regression approach performs well in theircontext because they only consider finitely many sensitive features. In the more general frameworkconsidered in the present paper where the sensitive features are allowed to take on continuous values,this approach would be sub-optimal since it cannot capture all dependencies. Finally, we ignorelabels when inferring new features. It is also worth pointing out that our approach is not based ona game played between generative models and an adversary but we provide closed form solutions.On other hand, while the focus of Donini et al. (2018) is mostly on empirical risk minimizationunder fairness constraints, the authors briefly discuss representation learning for fairness as well. Inparticular, Equation (13) in the reference paper effectively describes a conditional expectation inHilbert space, though it is not denoted or motivated as such. The conditional expectation is based onthe binary features S only and the construction is applied in the linear kernel context to derive newfeatures. The authors do not go beyond the linear case for representation learning but there is a clearlink to the more general notions of conditional expectation on which we base our work. We discussthe relation to Donini et al. (2018) in detail in Section 6.5 and we show how their approach can beextended beyond binary sensitive features by making use of our conditional expectation estimates. Organization.
The rest of the paper is organized as follows. In Section 2 we introduce our notationand provide preliminary definitions used in the paper. Our problem formulation and optimizationobjective are stated in Section 3. As part of the formulation we also define the notion of H -independence between Hilbert-space-valued features and the sensitive features. In Section 4 westudy the relation between H -independence and bounds on the dependencies between oblivious andsensitive features. In Section 5 we provide a solution to the optimization objective. In Section 6 wederive an estimator for the conditional expectation and use it to generate oblivious features and theoblivious kernel matrix. We provide some empirical evaluations in Section 7.
2. Preliminaries
In this section we introduce some notation and basic definitions. Consider a probability space (Ω , A , P ) . For any A ∈ A we let χA : Ω → { , } be the indicator function such that χA ( ω ) = 1 if,and only if, ω ∈ A . Let X be a measurable space in which a random variable X : Ω → X takes values.We denote by σ ( X ) the σ -algebra generated by X . Let H be an RKHS composed of functions h : X → R and denote its feature map by φ ( x ) : X → H where, φ ( x ) = k ( x, · ) for some positive RÜNEWÄLDER AND K HALEGHI definite kernel k : X × X → R . As follows from the reproducing kernel property of H we have (cid:104) φ ( x ) , h (cid:105) = h ( x ) for all h ∈ H . Moreover, observe that φ ( X ) is in turn a random variable attainingvalues in H . In Appendix A we provide some technical details concerning Hilbert-space-valuedrandom variables such as φ ( X ) . Conditional Expectation.
Let S : Ω → S be a random variable taking values in a measurablespace S . For the random variable X defined above, we denote by E S X the random variablecorresponding to Kolmogorov’s conditional expectation of X given S , i.e. E S X = E ( X | σ ( S )) , see,e.g. Shiryaev (1989). Recall that in a special case where S = { , } we simply have E ( X | S = 0) χ { S = 0 } + E ( X | S = 1) χ { S = 1 } where, E ( X | S = i ) is the familiar conditional expectation of X given the event { S = i } for i = 0 , .Thus, in this case, the random variable E S X is equal to E ( X | S = 0) if S attains value and isequal to E ( X | S = 1) otherwise. Note that the above example is for illustration only, and that X and S may be arbitrary random variables: they are not required to be binary or discrete-valued. Unlessotherwise stated, in this paper we use Kolmogorov’s notion of conditional expectation. We will alsobe concerned with conditional expectations that attain values in a Hilbert space H , which mostlybehave like real-valued conditional expectations (see Pisier (2016) and Appendix B for details). Next,we introduce Hilbert-space-valued L -spaces which play a prominent role in our results. Hilbert-space-valued L -spaces. For a Hilbert space H , we denote by L ( H ) = L (Ω , A , P ; H ) the H -valued L space. If H is an RKHS with a bounded and measurable kernel function then φ ( X ) is an element of L (Ω , A , P ; H ) . The space L (Ω , A , P ; H ) consists of all (Bochner)-measurablefunctions φ ( X ) from Ω to H such that E ( (cid:107) φ ( X ) (cid:107) ) < ∞ (see Appendix A for more details). Wecall these functions random variables or Hilbert-space-valued random variables and denote themwith bold capital letters. As in the scalar case we have a corresponding space of equivalence classeswhich we denote by L (Ω , A , P ; H ) . For φ ( X ) , Y ∈ L (Ω , A , P ; H ) we use φ ( X ) • , Y • for thecorresponding equivalence classes in L (Ω , A , P ; H ) . The space L (Ω , A , P ; H ) is itself a Hilbertspace with norm and inner product given by (cid:107) φ ( X ) • (cid:107) = E ( (cid:107) φ ( X ) (cid:107) ) and (cid:104) φ ( X ) • , Y • (cid:105) = E ( (cid:104) φ ( X ) , Y (cid:105) ) , where we use a subscript to distinguish this norm and inner product from the onesfrom H . The norm and inner product have a corresponding pseudo-norm and bilinear form acting on L ( H ) and we also denote these by (cid:107) · (cid:107) and (cid:104) · , · (cid:105) .
3. Problem Formulation
We formulate the problem as follows. Given two random variables X : Ω → X and S : Ω → S corresponding to non-sensitive and sensitive features in a dataset, we wish to devise a randomvariable Z : Ω → X which is independent of S and closely approximates X in the sense that for all Z (cid:48) : Ω → X we have, (cid:107) Z − X (cid:107) ≤ (cid:107) Z (cid:48) − X (cid:107) . (1)Dependencies between random variables can be very subtle and difficult to detect. Similarly,completely removing the dependence of X on S without changing X drastically is an intricate taskthat is rife with difficulties. Thus, we aim for a more tractable objective, described below, which stillgives us control over the dependencies.We start by a strategic shift from probabilistic concepts to interactions between functions andrandom variables. Consider the RKHS H of functions h : X → R with feature map φ as introduced BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS (a) X Ω H S φX Z S (b) S e e H Low dependenceregion φ [ X ] h ∗ Figure 1: (a)
The three main random variables in Problem 1 are shown. The non-sensitive features X attains values in X and is mapped onto the RKHS H through the feature map φ ; the sensitivefeatures S attains values in S , and Z attains values in H . All three random variables are defined onthe same probability space (Ω , A , P ) . (b) The image of X under φ is sketched (blue curve). This isa subset of H whose projection onto the subspace spanned by two orthonormal basis elements e and e is shown here. The set φ [ X ] is a low-dimensional manifold if φ is continuous. The element h ∗ = E ( φ ( X )) lies in the convex hull of φ [ X ] . Intuitively, if Z attains values mainly in the grayshaded area then Z is only weakly dependent on S .in Section 2, and assume that H is large enough to allow for the approximation of arbitrary indicatorfunctions χ { Z ∈ A (cid:48) } in the L -pseudo-norm for any X -valued random variable Z . Observe that if E ( h ( Z ) × g ( S )) = E ( h ( Z )) · E ( g ( S )) (2)for all h ∈ H , g ∈ L then Z and S are, indeed, independent. This is because h and g can be used toapproximate arbitrary indicator functions, which together with (2) gives, P ( { Z ∈ A (cid:48) } ∩ { S ∈ B (cid:48) } ) ≈ E ( h ( Z ) × g ( S )) = E ( h ( Z )) · E ( g ( S )) ≈ P ( Z ∈ A (cid:48) ) · P ( S ∈ B (cid:48) ) . This means that the independence constraint of the optimization problem of (1) translates to (2).Note that using RKHS elements as test functions is a common approach for detecting dependenciesand is used in the MMD-criterion (e.g. Gretton et al. (2008)).On the other hand, due to the reproducing property of the kernel of H , we can also rewrite theconstraint (2) as E ( (cid:104) h, φ ( Z ) (cid:105) × g ( S )) = E (cid:104) h, φ ( Z ) (cid:105) · E ( g ( S )) . (3)Observe that φ ( Z ) is a random variable that attains values in a low-dimensional manifold; if thekernel function is continuous and X = R d then the image φ [ X ] of X under φ is a d -dimensionalmanifold which we denote in the following by M . In Figure 1 this manifold is visualized as the bluecurve. Therefore, while Equation (3) is linear in φ ( Z ) , depending on the shape of the manifold, itcan lead to an arbitrarily complex optimization problem.We propose to relax (3) by moving away from the manifold, replacing φ ( Z ) with a randomvariable Z : Ω → H which potentially has all of H as its range. This simplifies the originaloptimization problem to one over a vector space under a linear constraint. To formalize the problem,we rely on a notion of H -independence introduced below. RÜNEWÄLDER AND K HALEGHI
Definition 1 ( H -Independence) We say that Z ∈ L (Ω , A , P ; H ) and S : Ω → S are H -independentif and only if for all h ∈ H and all bounded measurable g : S → R it holds that, E ( (cid:104) h, Z (cid:105) × g ( S )) = E (cid:104) h, Z (cid:105) × E ( g ( S )) . Thus, instead of solving for Z : Ω → X in (1), we seek a solution to the following optimizationproblem. Problem 1
Find Z ∈ L (Ω , A , P ; H ) that is H -independent from S (in the sense of Definition 1)and is close to X in the sense that (cid:107) Z − φ ( X ) (cid:107) ≤ (cid:107) Z (cid:48) − φ ( X ) (cid:107) for all Z (cid:48) which are also H -independent of S . Observe that the H -independence constraint imposed by Problem 1, ensures that all non-linearpredictions based on Z are uncorrelated with the sensitive features S . The setting is summarized inFigure 1(a). Projection onto M . If Z lies in the image of φ and H is a ‘large’ RKHS then H -independencealso implies complete independence between the estimator (cid:104) ˆ h, Z (cid:105) and S . To see this, assume thatthere exists a random variable W : Ω → X such that Z = φ ( W ) and that the RKHS is characteristic .Since for any f ∈ H and bounded measurable g : S → R E ( f ( W ) × g ( S )) = E ( (cid:104) f, Z (cid:105) × g ( S )) = E (cid:104) f, Z (cid:105) · E ( g ( S )) = E ( f ( W )) · E ( g ( S )) we can deduce that W and S is independent. Moreover, since Z is a function of W it is also inde-pendent of S . In general, Z will not be representable as some φ ( W ) and there can be dependenciesbetween (cid:104) ˆ h, Z (cid:105) and S . However, if Z attains values close to the manifold M then we can finda random variable W such that φ ( W ) is close to Z and the dependence between φ ( W ) and S iscontrolled by how close Z is to the manifold.Showing that a suitable W exists is not trivial; the difficulty is that for values that Z might attainin H there can be many points on the manifold closest to that value and selecting points on themanifold in a way that makes the random variable W well defined needs a result on measurableselections . The following proposition makes use of such a selection and guarantees the existence of asuitable W , i.e. it states that there exists a random variable W such that φ ( W ) achieves the minimaldistance to Z + h ∗ . Proposition 1
Consider Z ∈ L (Ω , A , P ; H ) , assume that the kernel function is continuous and(strictly) positive-definite, and X is compact. For any h ∗ ∈ H there exists a σ ( Z ) -measurablerandom variable W which attains values in X such that φ ( W ) ∈ L (Ω , A , P ; H ) and (cid:107) Z + h ∗ − φ ( W ) (cid:107) = d ( Z + h ∗ , M ) . Proof
Proof is provided in Appendix C.1.We will call such a variable W provided by the proposition a projection of Z on M . The variable W can be approximated algorithmically for a given Z and h ∗ (see Appendix E.3). Furthermore, φ ( W ) is a good approximation of φ ( X ) whenever Z is, as (cid:107) φ ( W ( ω )) − φ ( X ( ω )) (cid:107) ≤ (cid:107) φ ( W ( ω )) − Z ( ω ) (cid:107) + (cid:107) Z ( ω ) − φ ( X ( ω )) (cid:107) ≤ (cid:107) Z ( ω ) − φ ( X ( ω )) (cid:107) , where we used that φ ( W ( ω )) is closest to Z ( ω ) on M = φ [ X ] . Therefore, (cid:107) φ ( W ) − φ ( X ) (cid:107) ≤ (cid:107) Z − φ ( X ) (cid:107) . BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
4. Bounding the dependencies
A common approach to quantifying the dependence between random variables is to consider | P ( A ∩ B ) − P ( A ) P ( B ) | where A and B run over suitable families of events. In our setting, these families are the σ -algebras σ ( Z ) (or, alternatively, σ ( W ) ) and σ ( S ) , and the difference between P ( A ∩ B ) and P ( A ) P ( B ) , A ∈ σ ( Z ) or A ∈ σ ( W ) , B ∈ σ ( S ) , quantifies the dependence between the random variables Z and S , and W and S , respectively. Upper bounds on the absolute difference of these two quantitiesare related to the notion of α -dependence which underlies α -mixing. In times-series analysismixing conditions like α -mixing play a significant role since they provide means to control temporaldependencies (see, e.g., (Bradley, 2007; Doukhan, 1994)). The aim of this section is to show how thenotion of H -independence is related to the dependence between the random variables. In particular,Proposition 2 below states a bound on the dependence between W and S in terms of the distance of Z to the manifold M . More exactly, we allow Z to be translated by h ∗ ∈ H before measuring thedistance. This is important because the manifold itself can lie away from the origin while the Z weconstruct in Section 5 lies around the origin. The distance we consider is d ( Z + h ∗ , M ) = E ( inf h ∈M (cid:107) Z + h ∗ − h (cid:107) ) , the average Hilbert space distance between Z + h ∗ and the manifold. Observe that the expectationon the right side is well defined when M is compact since we can then replace M with a countabledense subset of M .Furthermore, if Z and φ ( W ) are closely coupled in the sense that there exists a constant c suchthat for any event A ∈ σ ( Z ) there exist an event A ∈ σ ( W ) fulfilling P ( A (cid:52) A ) ≤ c then thedependence between Z and S can also be bounded. For the bound to be useful we want a small valueof c for which the above holds, e.g. if we let c = 1 then the above holds trivially but the bound weprovide below becomes vacuous. In this context, observe that W , as constructed above, is a functionof Z and we know that σ ( W ) ⊂ σ ( Z ) . However, the opposite inclusion is not guaranteed to hold.Coming back to bounding the dependence between W and S : the high level idea is that H -independence would correspond to normal independence if we had function evaluations ‘ h ( Z ) ’instead of inner products (cid:104) h, Z (cid:105) (given that H is sufficient to approximate indicator functions).While generally there is no such expression for the inner product we know that for φ ( W ) we actuallyhave the equivalence (cid:104) h, φ ( W ) (cid:105) = h ( W ) due to the reproducing property of the kernel function.In contrast to Z the random variable φ ( W ) does not need to be H -independent of S , however, if Z + h ∗ and φ ( W ) are not too far from each other in (cid:107) · (cid:107) -norm then φ ( W ) will be approximately H -independent of S and we can say something about the dependence between W and S . Therefore, thebound below is stated in terms of (cid:107) Z + h ∗ − φ ( W ) (cid:107) , which is equal to the distance between Z + h ∗ and M , and a measure of how well indicator functions can be approximated. More specifically, thebound is controlled by the functional ψ ( A ) = inf f ∈H (cid:107) χA ( W ) − f ( W ) (cid:107) + (cid:107) f (cid:107) d ( Z + h ∗ , M ) , (4)where A ∈ { W [ C ] : C ∈ σ ( W ) } and f has to balance between approximating the indicator functionwhile keeping (cid:107) f (cid:107) d ( Z + h ∗ , M ) small. The function ψ has a natural interpretation as the minimal RÜNEWÄLDER AND K HALEGHI error that can be achieved in a regularized interpolation problem. If H lies dense in a certain space,then any relevant indicator can in principle be approximated arbitrary well. This is not saying that ψ ( A ) will be small since the norm of the element that approximates the indicator might be large.But the approximation error, which is (cid:107) χA ( W ) − f ( W ) (cid:107) , can be made arbitrary small. With thisnotation in place the proposition is as follows. Proposition 2
Consider a Z ∈ L (Ω , A , P ; H ) which is H -independent from S , suppose that thekernel function is continuous and (strictly) positive-definite, and X is compact. Let W be a projectionof Z on M . For any A ∈ σ ( W ) and B ∈ σ ( S ) , with A (cid:48) = W [ A ] being the image of A under W ,the following holds, | P ( A ∩ B ) − P ( A ) P ( B ) | ≤ ψ ( A (cid:48) ) . Furthermore, for A ∈ σ ( Z ) , if c > is such that B A = { W [ C ] : C ∈ σ ( W ) , P ( C (cid:52) A ) ≤ c } isnon-empty then for any B ∈ σ ( S ) , | P ( A ∩ B ) − P ( A ) P ( B ) | ≤ c + inf D ∈B A ψ ( D ) . Proof
Proof is provided in Appendix C.2.Intuitively, as visualized in Figure 1, the proposition states that if Z mostly attains values in the grayarea then the dependence between W and S is low and, if W and Z is strongly coupled, then thedependence between Z and S is also low. ψ ( A ) The key quantity in Proposition 2 is ψ ( A ) . To control ψ ( A ) it is necessary to control how well theRKHS can approximate indicators and to estimate the distance d ( Z + h ∗ , M ) . The former problemis more difficult and might be approached using the theory of interpolation spaces; we do not tryto develop the necessary theory here but only mention a simple result on denseness at the end ofthis section. On the other hand, the latter problem is easy to deal with: the distance d ( Z + h ∗ , M ) between Z + h ∗ and M can be estimated efficiently. In the case where the space X is compact and φ is a continuous function, we propose an empirical estimate of d ( Z + h ∗ , M ) given by d n ( Z + h ∗ , M ) := 1 n n (cid:88) i =1 min h ∈M (cid:107) Z i + h ∗ − h (cid:107) (5)where Z i , i ≤ n, n ∈ N , are n independent copies of Z . Note that the compactness of X togetherwith the continuity of φ make the min operator in (5) well-defined. Proposition 3
Consider a Z ∈ L (Ω , A , P ; H ) which is H -independent from S , suppose thatthe kernel function is continuous and (strictly) positive-definite, and X is compact. Let ρ =max x ∈ X (cid:107) φ ( x ) (cid:107) < ∞ . For any h ∗ ∈ H with (cid:107) h ∗ (cid:107) ≤ ρ and every (cid:15) > we have, Pr( | d n ( Z + h ∗ , M ) − d ( Z + h ∗ , M ) | ≥ (cid:15) ) ≤ (cid:16) − n(cid:15) ρ (cid:17) . Proof
Proof is provided in Appendix C.3.Coming back to the approximation error (cid:107) χA ( W ) − f ( W ) (cid:107) , where A ⊂ X is the image under BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS W of some set C ∈ σ ( W ) and f ∈ H we like to mention the following: let ν = P W − be thepush-forward measure of P under W . If H lies dense in L ( X , B , ν ) then for any such A and any (cid:15) > there exists a function f such that (cid:107) χA ( W ) − f ( W ) (cid:107) < (cid:15) , i.e. for the measurable set A thereexists a function f ∈ H such that (cid:90) ( χA ( W ) − f ( W )) dP = (cid:90) ( χA ( x ) − f ( x )) dν ( x ) < (cid:15) , using (Fremlin, 2001, Theorem 235Gb). In many cases the continuous functions C ( X ) lie dense in L ( X , B , ν ) and a universal RKHS H is sufficient to approximate the indicators χW (see Sriperum-budur et al. (2011)).
5. Best H -independent features In this section we discuss how to obtain Z as a closed-form solution to Problem 1. To this end,inspired by the sub-problem in the linear case, we obtain Z using Hilbert-space-valued conditionalexpectations. We further show that these features are H -independent of S and that Z is the best H -independent approximation of φ ( X ) .In the linear case discussed in the Introduction it turned out that Z = X − E S X + EX is a goodcandidate for the new features Z . In the Hilbert-space-valued case a similar result holds. The maindifference here is that we do have to work with Hilbert-space-valued conditional expectations. Forany random variable φ ( X ) ∈ L (Ω , A , P ; H ) , and any σ -subalgebra B of A , conditional expectation E B φ ( X ) is defined and is again an element of L (Ω , A , P ; H ) . We are particularly interested inconditioning with respect to the sensitive random variable S . In this case, B is chosen as σ ( S ) , thesmallest σ -subalgebra which makes S measurable, and we denote this conditional expectation by E S φ ( X ) . In the following, we use the notation φ ( X ) = φ ( X ) . A natural choice for the new featuresis Z = φ ( X ) − E S φ ( X ) + E ( φ ( X )) . (6)The expectation E ( φ ( X )) is to be interpreted as the Bochner-integral of φ ( X ) given measure P .Importantly, if S and φ ( X ) are independent, we have with this choice that Z = φ ( X ) = φ ( X ) andwe are back to the standard kernel setting. Also, if φ ( X ) ∈ L (Ω , A , P ; H ) then so is Z .We can verify that the features Z are, in fact, H -independent of S . In particular, for any h ∈ H and g ∈ L , E ( (cid:104) φ ( X ) − E S φ ( X ) , h (cid:105) × g ( S ))= (cid:104) E ( φ ( X ) × g ( S )) − E (( E S φ ( X )) × g ( S )) , h (cid:105) = (cid:104) E ( φ ( X ) × g ( S )) − E ( E S ( φ ( X ) × g ( S ))) , h (cid:105) = 0 . Since E ( φ ( X )) is a constant this implies that E ( (cid:104) Z , h (cid:105) × g ( S )) = E ( h ( X )) · E ( g ( S )) A similarargument shows that E (cid:104) Z , h (cid:105) = E ( h ( X )) . Thus, Z is H -independent of S .In Figure 2 the effect of the move from φ ( X ) to Z is visualized. In the figure S is plotted against h ( X ) and h ( X ) (blue dots), where h corresponds to the quadratic function and h to the sinusfunction. The dependencies between h ( X ) and S , as well as h ( X ) and S , are high and there isclear trend in the data. The two red curves correspond to the best regression functions, using S topredict h ( X ) and h ( X ) . The relation between the new features and S is shown in the other twoplots (gray dots). In the case of h one can observe that the dependence between (cid:104) h , Z (cid:105) and S RÜNEWÄLDER AND K HALEGHI h ( X ) h , Z X = S + U and h ( x ) = x h ( X ) h , Z X = S + U and h ( x ) = sin( x ) Figure 2: The figure shows data from two different settings. In the left two plots X = S + U , where S and U are independent, S is uniformly distributed on [0 , and U is uniformly distributed on [ − / , / . The function h is the quadratic function. The leftmost plot shows h ( X ) against S and the plot to its right shows a centered version of (cid:104) h , Z (cid:105) plotted against S . Similarly for the rightplots with the difference that S is uniformly distributed on [0 , π ] and U is uniformly distributed on [0 , π/ . The function h ( x ) is sin( x ) . The red curves show the best regression curve, predicting h ( X ) and h ( X ) using S .is much smaller and, by the design of Z , (cid:104) h , Z (cid:105) and S are uncorrelated. Similarly, for (cid:104) h , Z (cid:105) ,whereas here the dependence to S seems to be even lower and it is difficult to visually verify anyremaining dependence between S and (cid:104) h , Z (cid:105) .An interesting aspect of this transformation from X to Z is that Z is automatically uncorrelatedwith S for all functions h in the corresponding RKHS, without the need to ever explicitly consider aparticular h . Besides being H -independent of S these new features Z also closely approximates ouroriginal features φ ( X ) if the influence from S is not too strong, i.e. the mean squared distance is E ( (cid:107) φ ( X ) − Z (cid:107) ) = E ( (cid:107) E S φ ( X ) − E ( φ ( X )) (cid:107) ) which is equal to zero if X is independent of S . In fact, Z is the best approximation of φ ( X ) in the mean squared sense under the H -independent constraint. This is essentially a property ofthe conditional expectation which corresponds to an orthogonal projection in L (Ω , A , P ; H ) . Wesummarize this property in the following result. Proposition 4
Given φ ( X ) , Z (cid:48) ∈ L (Ω , A , P ; H ) such that Z (cid:48) is H -independent of S , then E ( (cid:107) φ ( X ) − Z (cid:48) (cid:107) ) ≥ E ( (cid:107) φ ( X ) − Z (cid:107) ) , where Z = φ ( X ) − E S φ ( X ) + E ( φ ( X )) . Furthermore, Z is the unique minimizer (up to almostsure equivalence). Proof
Proof provided in Appendix C.4.
Change in predictions.
When replacing φ ( X ) by Z we lose information (we reduce the influenceof the sensitive features). An interesting question to ask is, ‘how much does the reduction in BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS information change our predictions?’ A simple way to bound the difference in predictions is asfollows. Consider any h ∈ H , for instance corresponding to a regression function, then | h ( X ) − (cid:104) h, Z (cid:105)| ≤ (cid:107) h (cid:107)(cid:107) φ ( X ) − Z (cid:107) ≤ (cid:107) h (cid:107)(cid:107) E S φ ( X ) − E ( φ ( X )) (cid:107) where (cid:107) E S φ ( X ) − E ( φ ( X )) (cid:107) effectively measures the influence of S . Hence, the difference inprediction is upper bound by the norm of the predictor (here h ) and a quantity that measures thedependence between S and φ ( X ) . Example.
To demonstrate that the effect of the move from X to Z can be profound we considerthe following fundamental example: suppose that X and S are standard normal random variableswith covariance c ∈ [ − , and consider the linear kernel k ( x, y ) = xy , x, y ∈ R . In this case φ ( X ) = X and E S X = cS is also normally distributed (see Bertsekas and Tsitsiklis (2002)[Sec4.7]).Hence, Z = X − E S X + E ( X ) is normally distributed and E ( Z × S ) = c − cE ( S ) = 0 . Thisimplies that Z and S are, in fact, fully independent , regardless of how large the dependence betweenthe original features X and the sensitive features S may be. In the case where X and S are fullydependent, i.e. X = aS for some a ∈ R , the features Z are equal to zero and do not approximate X .Next, consider a polynomial kernel of second order such that the quadratic function h ( x ) = x lies within the corresponding RKHS. The inner product between this h and Z is equal to X − E S X + E ( X ) and is not independent of S . Hence, the kernel function affects the dependence between Z and S . Also, within the same RKHS there lie linear functions and for any linear function h (cid:48) it holds that (cid:104) Z , h (cid:48) (cid:105) is independent of S . Therefore, within the same RKHS we can have directionsin which Z is independent of S and directions where both variables are dependent.
6. Generating oblivious features from data
To be able to generate the features Z we need to first estimate the conditional expectation E S φ ( X ) from data. To this end, we devise a plugin-approach. After introducing this approach in Section6.1 we discuss how the estimation errors of the plugin-estimator can be controlled in Section 6.2.In Section 6.3 we show how the oblivious features can be generated. Finally, in Section 6.4, wedemonstrate how the approach can be applied to statistical problems and we discuss relations to theapproach of Donini et al. (2018) in Section 6.5. A common method for estimation is the plug-in approach whereby an unknown probability measureis replaced by the empirical measure. This approach is used in Grünewälder (2018) for derivingestimators of conditional expectations. To see how the approach can be generalized to our setting,first observe that we can write E S φ ( X ) = g ◦ S almost surely , (7)where g : S → H is a Bochner-measurable function (see Appendix A and Lemma 2 for details). Ouraim is to estimate this function g from i.i.d. observations { ( X i , S i ) } i ≤ n . For any subset B of the rangespace S of the sensitive features define the empirical measure P n ( S ∈ B ) = (1 /n ) (cid:80) ni =1 δ S i ( B ) , where δ S i the Dirac measure with mass one at location S i . We define an estimate of the conditional RÜNEWÄLDER AND K HALEGHI expectation of φ ( X ) given that the sensitive variable falls into a set B by E n ( φ ( X ) | S ∈ B ) = 1 nP n ( S ∈ B ) n (cid:88) i =1 φ ( X i ) × δ S i ( B ) , when P n ( S ∈ B ) > and through E n ( φ ( X ) | S ∈ B ) = 0 otherwise. Observe that for h ∈ H wehave, (cid:10) h, nP n ( S ∈ B ) n (cid:88) i =1 φ ( X i ) × δ S i ( B ) (cid:11) = 1 nP n ( S ∈ B ) n (cid:88) i =1 h ( X i ) × δ S i ( B ) . We can also write this as (cid:104) h, E n ( φ ( X ) | S ∈ B ) (cid:105) = E n ( h ( X ) | S ∈ B ) . An estimate of the conditionalexpectation given S is provided by E Sn φ ( X ) = (cid:88) B ∈ ℘ S E n ( φ ( X ) | S ∈ B ) × χ { S ∈ B } , where ℘ S is a finite partition of the range space S of S . A common choice for ℘ S if S is the hypercube [0 , d , d ≥ , are the dyadic sets. Observe, that we can move inner products inside the conditionalexpectation E Sn φ ( X ) so that (cid:104) h, E Sn φ ( X ) (cid:105) = E Sn h ( X ) , where E Sn h ( X ) is the empirical conditionalexpectation introduced in Grünewälder (2018). The estimation error when estimating E S φ ( X ) using E Sn φ ( X ) is relatively easy to control thanksto the plug-in approach. Essentially, standard results concerning the empirical measure carry overto conditional expectation estimates in the real-valued case (Grünewälder, 2018). But throughscalarization we can transfer some of these results straight away to the Hilbert-space-valued case.For instance, using φ ( X ) in place of φ ( X ) , (cid:107) E n ( φ ( X ) | S ∈ B ) − E ( φ ( X ) | S ∈ B ) (cid:107) = sup (cid:107) h (cid:107)≤ |(cid:104) E n ( φ ( X ) | S ∈ B ) − E ( φ ( X ) | S ∈ B ) , h (cid:105)| = sup (cid:107) h (cid:107)≤ | E n ( h ( X ) | S ∈ B ) − E ( h ( X ) | S ∈ B ) | and bounds on the latter term are known. Similarly, (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) = sup (cid:107) h (cid:107)≤ | E Sn h ( X ) − E S ( h ( X )) | . (8)However, both E Sn φ ( X ) and E S φ ( X ) are random variables and a useful measure of their differenceis the L -pseudo-norm. The L -pseudo-norm should in this case not be taken with respect to P itselfbut conditional on the training sample. Hence, for i.i.d. pairs ( X, S ) , ( X , S ) , . . . , ( X n , S n ) let F n = σ ( X , S , . . . , X n , S n ) and define the ‘conditional’ L -pseudo-norm by (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) ,n = E F n (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) . BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
Substituting Equation (8) in shows that this expression is equal to E F n (cid:16) sup (cid:107) h (cid:107)≤ | E Sn h ( X ) − E S h ( X ) | (cid:17) . The supremum cannot be taken out of the conditional expectation, however, by writing E Sn h ( X ) and E S h ( X ) as simple functions (see Appendix A.1) we can get around this difficulty and control theerror in (cid:107) · (cid:107) ,n . We demonstrate this in the following by deriving rates of convergence for two cases:for the case where S is finite, and for the case where S is the unit cube in R d for some d ≥ and S has a density that is bounded away from zero.To derive these rates we rely, among other things, on the convergence of the empirical processuniformly over families of functions related to the unit ball of H and partitions of S . For instance, inthe case where S is finite we need to assume that H S := { ( h ◦ π ) × χ ( X × { s } ) : h ∈ H , (cid:107) h (cid:107) ≤ , s ∈ S } , as a family of real-valued functions on X × S , is a P -Donsker class. The function π : X × S → X is here the projection onto the first argument, i.e. π ( x, s ) = x . For the definition of P -Donskerclasses see Dudley (2014); Giné and Nickl (2016).There are various ways to verify this condition in concrete settings. For example, if H is a finitedimensional RKHS then H S is a P -Donsker class under a mild measurbility assumption. This followsfrom a few simple arguments: any finite dimensional space of functions is a VC-subgraph class (Ginéand Nickl, 2016, Ex.3.6.11); this implies directly that { ( h ◦ π ) × χ ( X × { s } ) : h ∈ H , (cid:107) h (cid:107) ≤ } is a VC-subgraph class for every s ∈ S . Furthermore, finite unions of VC-subgraph classes areagain a VC-subgraph class; under a mild measurbility assumption it follows now from Dudley (2014,Cor.6.19) that H S is a P -Donsker class.There are obviously other ways to prove this statement. In particular, one might use that the unitball of H is a universal Donsker class (see Dudley (2014); Giné and Nickl (2016) for details) when thekernel function is continuous and X is compact (this also holds when H is infinite dimensional): dueto Marcus (1985) the unit ball of a Hilbert space is a universal Donsker class if sup x ∈ X | h ( x ) | ≤ c (cid:107) h (cid:107) for some constant c that does not depend on h . If the kernel function is bounded c = (cid:112) k ( x, x ) witnesses that this property holds. Case 1: finitely many sensitive features.
Our first proposition states that the estimator convergeswith the optimal rate n − / when S is finite and H S is a P -Donsker class. Proposition 5
Given a finite space S and a P -Donsker class H S , it holds that (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) ,n ∈ O ∗ P ( n − / ) . Proof
The proof is given in Appendix C.5.
Case 2: [0 , d -valued sensitive features. We extend Proposition 5 to the case where S is notconfined to taking finitely many values. In order to state the result, we introduce the followingnotation. Set S := [0 , d for some d ∈ N and let g : S → H be such that with probability one E S φ ( X ) = g ◦ S (which is possible by Lemma 2). Consider a discretization of S into dyadic cubes ∆ , ∆ , . . . , ∆ (cid:96) d of side-length /(cid:96) for some (cid:96) ∈ N . Define C (cid:96) := { X × ∆ : ∆ ∈ D (cid:96) } and let H C := { h × χD : h ∈ H , (cid:107) h (cid:107) ≤ , D ∈ (cid:83) (cid:96) ∈ N C (cid:96) } . RÜNEWÄLDER AND K HALEGHI
Proposition 6
Suppose that the push forward measure µ := P S − has density u with respect to theLebesgue measure λ on S with the property that inf s ∈ S u ( s ) ≥ b for some b > . Assume that g ◦ S is L -lipschitz continuous and that H C is a P -Donsker class. We have (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) ,n ∈ O ∗ P ( n − d +1) ) . Proof
The proof is given in Appendix C.6.
Given a data-point ( X, S ) composed of non-sensitive and sensitive features X and S respectively,we can generate an oblivious random variable Z as Z := φ ( X ) − E Sn φ ( X ) + E n ( φ ( X )) . (9)Most kernel methods work with the kernel matrix and do not need access to the features themselves.The same holds in our setting. More specifically, we never need to represent Z explicitly in theHilbert space but only require inner-product calculations. In order to calculate the empirical estimatesof the conditional expectation E Sn φ ( X ) and of E n ( φ ( X )) in (9) we consider a simple approachwhereby we split the training set into two subsets of size n , and use half the observations to obtainthe empirical estimates of the expectations. The remaining n observations are used to obtain an oblivious predictor; we have two cases as follows. Case 1 (M-Oblivious).
The standard kernel matrix K is calculated with the remaining n obser-vations and a kernel-method is applied to K to obtain a predictor g . When applying the predictorto a new unseen data-point ( X, S ) we first transform X into Z via (9) and calculate the predictionas (cid:104) g, Z (cid:105) . As discussed in the Introduction, we conjecture that this approach is suitable in the casewhere the labels Y are conditionally independent of the sensitive features S given the non-sensitivefeatures X , i.e. when S, X, Y form a Markov chain S → X → Y . As such we call this approach M -Oblivious. Case 2 (Oblivious).
Instead of calculating the kernel matrix K an oblivious kernel matrix , i.e. O = (cid:107) Z (cid:107) · · · (cid:104) Z , Z n (cid:105) ... . . . ... (cid:104) Z n , Z (cid:105) · · · (cid:107) Z n (cid:107) , (10)is calculated by applying Equation (9) to the remaining training samples ( X i , S i ) before taking innerproducts. The oblivious matrix is then passed to the kernel-method to gain a predictor g . The matrixis positive semi-definite since a (cid:62) O a = (cid:107) (cid:80) ni =1 a i Z i (cid:107) ≥ , for any a ∈ R n . The complexity tocompute the matrix is O ( n ) (see Appendix E for details on the algorithm). Prediction for a newunseen data-point ( X, S ) is now done in the same way as in Case 1. In this section we showcase our approach in the context of kernel ridge regression. We have threerelevant random variables, namely, the non-sensitive features X , the sensitive features S and labels BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS Y which are real valued. We assume that we have n i.i.d. observations { ( X i , S i , Y i ) } i ≤ n . We usethe observations n + 1 , . . . , n to generate the oblivious random variables Z i and then use obliviousdata { ( Z i , Y i ) } i ≤ n for oblivious ridge regression (ORR).The ORR problem has the following form. Given a positive definite kernel function k : X × X → R , a corresponding RKHS H and oblivious features Z i . Our aim is to find a regression function h ∈ H such that the mean squared error between (cid:104) h, Z (cid:105) and Y is small. Replacing the meansquared error by the empirical least-squares error and adding a regularization term for h gives us theoptimization problem ˆ h = arg min h ∈H n (cid:88) i =1 ( (cid:104) h, Z i (cid:105) − Y i ) + λ (cid:107) h (cid:107) , (11)where λ > is the regularization parameter.It is easy to see that the setting is not substantially different from standard kernel ridge regressionand derive a closed form solution for ˆ h . More specifically, we have a representer theorem in thissetting which tells us that the minimizer lies in the span of Z , . . . , Z n . One can then solve theoptimization problem in the same way as for standard kernel ridge regression, see Appendix D fordetails. The solution to the optimization problem is ˆ h = (cid:80) ni =1 α i Z i , where α = ( O + λI ) − y .The vector y is given by ( Y , . . . , Y n ) (cid:62) . Predicting Y for a new observation ( X, S ) is achievedby first generating the oblivious features Z (see Appendix E.2) and then by evaluating (cid:104) Z , ˆ h (cid:105) = (cid:80) ni =1 α i (cid:104) Z , Z i (cid:105) . Our focus in this paper is on generating features that are less dependent on the sensitive features thanthe original non-sensitive features. However, the conditional expectation E S φ ( X ) , which is at theheart of our approach, also features prominently in methods that add constraints to SVM classifiers.In particular, in Donini et al. (2018) a constraint is used to achieve approximately equal opportunityin classification where the sensitive feature is binary. While their approach does not make explicituse of conditional expectations one can recognize that the key object in their approach (Eq. (13)in Donini et al. (2018)) is, in fact, closely related to our conditional expectation when used in thecase where S can attain only two values (say S = { , } ). In detail, the optimization problem (14) isconstraint by enforcing for a given (cid:15) > that the solution h ∗ ∈ H fulfills | E n ( h ∗ ( X ) | S = 0) − E n ( h ∗ ( X ) | S = 1) | ≤ (cid:15). (12)Considering Z = φ ( X ) − E S φ ( X ) + E ( φ ( X )) we can observe right away that in this setting forall h ∈ H , E ( (cid:104) h, Z (cid:105)| S = 0) = E ( (cid:104) h, Z (cid:105)| S = 1) . To see this observe that E S (cid:104) h, Z (cid:105) is almost surely equal to the E ( h ( X )) . In other words E ( (cid:104) h, Z (cid:105)| S = 0) × χ { S = 0 } + E ( (cid:104) h, Z (cid:105)| S = 1) × χ { S = 1 } = E S (cid:104) h, Z (cid:105) is almost surely constant. Unless P ( S = 0) = 0 or P ( S = 1) = 0 this implies that E ( (cid:104) h, Z (cid:105)| S =0) = E ( (cid:104) h, Z (cid:105)| S = 1) . Hence, for the max-margin classifier h ∗ and Z it holds that E ( (cid:104) h ∗ , Z (cid:105)| S =0) = E ( (cid:104) h ∗ , Z (cid:105)| S = 1) and on the population level our new features Z guarantee that constraint(12) is automatically fulfilled. RÜNEWÄLDER AND K HALEGHI (a) (b)
Figure 3: Binary classification error vs. (cid:101) β -dependence between prediction and sensitive features isshown for three different methods: classical Linear SVM, Linear FERM, and Oblivious SVM. InFigure 3a the error is calculated with respect to the observed labels which are intrinsically biased andin Figure 3a the error is calculated with respect to the true fair classification rule.
7. Empirical evaluation
In this section we report our experimental results for classification and regression. Our objective inthe classification experiment is to point out an important property of supervised learning problemswhere sensitive features affect both the non-sensitive features and the labels: the estimation error ofthe observed labels can be misleading as a quality measure. The aim is much rather to predict valuesin an unbiased fashion. The first experiment highlights this difference by considering a syntheticdata set for which we know the unbiased labels (though the true unbiased labels arre not availableto the methods). We measure the dependencies between the predicted values and the sensitivefeatures, and compare against a standard SVM and to FERM. The second set of experiments aims toinvestigate how dependencies between sensitive and non-sensitive features affect ORR and M-ORR.We are investigating this relationship by considering a family of synthetic problems for which wecan adjust the dependency between the features using a parameter γ . In this set of experiments weare also concerned with clarifying the relationship between ORR and M-ORR, where the latter is theM-Oblivious version of KRR, see Section 6.3. Our implementation can be found at the followingrepository: https://github.com/azalk/Oblivious.git . We carried out an experiment to mimic a scenario where a class of students should normally receivegrades between and , and anyone with a grade above a fixed threshold θ = 2 should pass. Halfof the class, representing a “minority group”, are disadvantaged in that their grades are almostsystematically reduced, while the other half receive a boost on average. More specifically, let thesensitive feature S be a { , } -valued Bernoulli random variable with parameter . , and let X bedistributed according to a truncated normal distribution with support [1 , . Let the non-sensitivefeature X , representing a student’s grade, be given by X := ( X − B ) χ { S = 0 } + ( X + B ) χ { S = 1 } BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS where B is a Bernoulli random variable with parameter . independent of X and of S . The label Y is defined as a noisy decision influenced by the student’s “original grade” X prior to the S -basedmodification. More formally, let U be a random variable independent of X and of S , and uniformlydistributed on [0 , . Let Y := χ { U ≥ X } and define Y := Y χ { X + S ≥ θ } . Classification Error.
In a typical classification problem, the labels Y depend on both X and S so when we remove the bias it is not clear what we should compare against when calculating theclassification performance. Observe that our experimental construction here allows access to the trueground-truth labels Y ∗ := χ { X ≥ θ } . (13)Therefore, we are able to calculate the true (unbiased) errors as well. However, this is not always thecase in practice. In fact, we argue that the question of how to evaluate fair classification performanceis an important open problem which has yet to be addressed. Measure of Dependence.
Let F n := σ ( X , . . . , X n , S , . . . , S n ) , n ∈ N be the σ -algebra gener-ated by the training samples. In this experiment, we measure the dependence between the predictedlabels (cid:98) Y produced by any algorithm and the sensitive features S as (cid:101) β ( (cid:98) Y , S ) := 12 (cid:88) s ∈{ , } (cid:88) y ∈{ , } E (cid:12)(cid:12)(cid:12) P ( (cid:98) Y = y, S = s |F n ) − P ( (cid:98) Y = y |F n ) P ( S = s ) (cid:12)(cid:12)(cid:12) (14)which is closely related to the β -dependence (see, e.g. (Bradley, 2007, vol. I, p. 67)) between theirrespective σ -algebras. We obtain an empirical estimate of (cid:101) β ( σ ( (cid:98) Y ) , σ ( S )) by simply replacing theprobabilities in (14) with corresponding empirical frequencies. Experimental results.
We generated n = 1000 training and test samples as described above andthe errors reported for each experiment are averaged over repetitions. Figure 3 shows binaryclassification error vs. dependence between prediction and sensitive features for three differentmethods: classical Linear SVM, Linear FERM, and Oblivious SVM. In Figure 3a the error iscalculated with respect to the observed labels which are intrinsically biased and in Figure 3a the erroris calculated with respect to the true fair classification rule Y ∗ given by (13). As can be seen in theplots, the true classification error of Oblivious SVM is smaller than that of the other two methods.Moreover, in both plots the β -dependence between the predicted labels produced by Oblivious SVMand the sensitive feature is close to and is much smaller than that of the other two methods. In this section we compare ORR with KRR and the ‘Markov’ version of ORR, M-ORR, whichapplies the KRR solution to oblivious test features Z . We use an RBF kernel with σ = 1 . We areparticularly interested in how the dependence of S on X affects the performance and in a comparisonof ORR to M-ORR. We use synthetic data to be able to control the dependence between S and X .The basic data generating process is as follows. Sensitive features S and non-sensitive features U are sampled independently from a uniform distribution with support [ − , . The features X are aconvex combination of these two of the form X = γU + (1 − γ ) S , γ ∈ [0 , . We consider two waysto generate the response variable Y . In Experiment 1 , the response variable is Y = X + (cid:15) , where (cid:15) RÜNEWÄLDER AND K HALEGHI M S E Ridge Regression (Experiment 1)
KRRORRM-ORR (a) M S E Ridge Regression (Experiment 2)
KRRORRM-ORR (b)
Figure 4: Plots 4a and 4b correspond to Ridge Regression Experiments 1 and 2 respectively. In bothplots, the performance of three estimators (KRR,ORR and M-ORR) is plotted against γ , where γ controls the dependence of X on S . The case of γ = 0 corresponds to the highest dependence while γ = 1 corresponds to the case in which X is independent of S .is normally distributed with variance . and is independent of U and S . In this case S → X → Y forms a Markov chain and we expect M-ORR to do well. In Experiment 2 , the variable S influences Y also directly and not only through X , i.e. Y = X + S + (cid:15) . We use here S instead of S because S is not a zero mean random variable and cannot simply be consumed into the noise term.Figure 4 shows the results of these experiments. In these experiments, γ varies between , . , . . . . , . For each value of γ we generate data points for ORR and M-ORR to in-fer the conditional expectations and further data points are used by all three methods to calculatethe ridge regression solution. For simplicity, we fixed a partition for the conditional expectation: theset S = [ − , is split into a dyadic partition consisting of 16 sets. Each method uses a validation setof data points (which are different from the training data points) to select the regularizationparameter λ from − , − , . . . , . A test set of size is used to calculate the mean squared error(MSE). For each γ the experiment is repeated times. Figure 4 reports the average MSE and thestandard deviation of the MSE over these experiments.We make the following observations from Figure 4a. KRR is the best estimator as it uses thefeatures X directly and not the new features Z . As γ → both the ORR and M-ORR estimatorsapproach the KRR estimator since the effect of S on X vanishes. Both estimators do not quitereach the performance of the KRR estimator. This is due to the additional uncertainty introduced byestimating the conditional expectations. By definition, the ORR estimator will achieve the best fit ofthe training data given the new features Z . We can observe that the M-ORR estimator is performingas well as the ORR estimator even though the M-ORR estimator uses the KRR solution and applies itto Z . This is due to the fact that S → X → Y forms a Markov chain. Finally, when γ = 0 both theM-ORR and ORR estimator achieve an MSE that is very close to the best MSE that can be achievedby a regressor that generates values which are independent of S : assume that some new features Z (cid:48) are given which are a function of X and are independent of S . when γ = 0 this random variable Z (cid:48) can only be independent of S if Z (cid:48) is a constant. However, if Z (cid:48) is a constant then the ridge-regressorusing Z (cid:48) is also a constant and the MSE E ( Y − c ) = E ( S ) − cE ( S ) + c + 0 . is minimized BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS for c = E ( S ) . The minimal value is approximately . which is very close to the values of theORR and M-ORR estimator.Figure 4b shares a few characteristics with Figure 4a as follows. For γ = 0 both M-ORR andORR attain an MSE that is close to the best possible (in the above sense) which is approximately equalto . . As before, KRR is the overall best estimator and ORR is the best estimator using features Z . Furthermore, as γ → both estimators become close to the KRR solution. A crucial differencein this experiment is that S, X, Y does not form a Markov chain anymore and the performance ofM-ORR is worse than that of ORR for values of γ between . and . . The performance of M-ORRand ORR is essentially the same for γ = 0 and γ = 1 . This is not surprising given that when γ = 0 then Y = 2 S + (cid:15) and we are back in the Markov chain setting, while when γ = 1 then X is alreadyindependent of S .
8. Discussion
We have introduced a novel approach to derive oblivious features which approximate non-sensitivefeatures well while maintaining only minimal dependence on sensitive features. We make use ofHilbert-space-valued conditional expectations and estimates thereof; our plug-in estimators in thiscase can be of independent interest in future research, and in turn open grounds for interestingquestions involving their guarantees. The application of our approach to kernel methods is facilitatedby an oblivious kernel matrix which we have derived to be used in place of the original kernelmatrix. We characterize the dependencies between the oblivious and the sensitive features in termsof how ‘close’ the sensitive features are to the low-dimensional manifold φ [ X ] . One may wonderif this relation can be exploited to further reduce dependencies, and potentially achieve completeindependence. Another important question concerns the interplay between the estimation errorsintroduced by estimating conditional expectations and the estimation errors introduced by kernelmethods which are applied to the oblivious data. Appendix A. Probability in Hilbert spaces: elementary results
In this section we summarize a few elementary results concerning random variables that attain valuesin a separable Hilbert space which we use in the paper.
A.1 Measurable functions
There are three natural definitions of what it means for a function f : Ω → H to be measurable.Denote the measure space in the following by (Ω , A ) with the understanding that these definitionsapply, in particular, to Ω = R d and A being the corresponding Borel σ -algebra.1. f is Bochner-measurable iff f is the point-wise limit of a sequence of simple functions, where S : Ω → H is a simple function if it can be written as S ( ω ) = n (cid:88) i =1 h i × χA i ( ω ) for some n ∈ N , A , . . . , A n ∈ A and h , . . . , h n ∈ H .2. f is strongly-measurable iff f − [ B ] ∈ A for every Borel-measurable subset B of H . Thetopology that is used here is the norm-topology. RÜNEWÄLDER AND K HALEGHI f is weakly-measurable iff for every element h ∈ H the function (cid:104) h, f (cid:105) : Ω → R is measurablein the usual sense (using the Borel-algebra on R ).All three definitions of measurability are equivalent in our setting. We call a function f : Ω → H a random variable if it is measurable in this sense.The main example in our paper is f = φ ( X ) . This is a well defined random variable whenever X : Ω → R d and φ : R d → H are both Borel-measurable. Appendix B. Hilbert space-valued conditional expectations
B.1 Basic properties
We recall a few important properties of Hilbert space valued conditional expectations. These oftenfollow from properties of real-valued conditional expectations through ‘scalarization’ (Pisier, 2016).In the following, let φ ( X ) , Z ∈ L (Ω , A , P ; H ) and B some σ -subalgebra of A . Due to Pisier(2016)[Eq. (1.7)], for any f ∈ H(cid:104) f, E B φ ( X ) (cid:105) = E B (cid:104) f, φ ( X ) (cid:105) (a.s.) (15)and the right hand side is just the usual real-valued conditional expectation. It is also worth high-lighting that the same holds for the Bochner-integral E ( φ ( X )) , i.e. for any f ∈ H , (cid:104) f, E ( φ ( X )) (cid:105) = E (cid:104) f, φ ( X ) (cid:105) . This can be used to derive properties of E B φ ( X ) . For instance, since E ( E B (cid:104) f, φ ( X ) (cid:105) ) = E (cid:104) f, φ ( X ) (cid:105) is a property of real-valued conditional expectations we find right away that (cid:104) f, E ( φ ( X )) (cid:105) = E (cid:104) f, φ ( X ) (cid:105) = E ( E B (cid:104) f, φ ( X ) (cid:105) ) = E (cid:104) f, E B φ ( X ) (cid:105) = (cid:104) f, E ( E B φ ( X )) (cid:105) . Because E ( φ ( X )) and E ( E B φ ( X )) are elements of H and for all f ∈ H(cid:104) f, E ( φ ( X )) − E ( E B φ ( X )) (cid:105) = 0 it follows that E ( φ ( X )) = E ( E B φ ( X )) .Another result we need is that if Z is B -measurable then E B (cid:104) φ ( X ) , Z (cid:105) = (cid:104) E B φ ( X ) , Z (cid:105) (a.s.) . Showing this needs a bit more work. Since Z ∈ L (Ω , B , P ; H ) there exist B -measurable simplefunctions U n such that U n converges point-wise to Z , lim n →∞ (cid:107) U • n − Z • (cid:107) = 0 and the sequencefulfills (cid:107) U n (cid:107) ≤ (cid:107) Z (cid:107) for all n ∈ N (Pisier, 2016)[Prop.1.2]. Consider some n and write U n = m (cid:88) i =1 h i × χA i , for a suitable m ∈ N , h i ∈ H , A i ∈ B , then E B (cid:104) φ ( X ) , U n (cid:105) = m (cid:88) i =1 E B ( (cid:104) φ ( X ) , h i (cid:105) × χA i )= m (cid:88) i =1 ( E B (cid:104) φ ( X ) , h i (cid:105) ) × χA i = m (cid:88) i =1 (cid:104) E B φ ( X ) , h i (cid:105) × χA i = (cid:104) E B φ ( X ) , U n (cid:105) (a.s.) , BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS because χA i is B -measurable. For the right hand side point-wise convergence of U n to Z tells us thatfor all ω ∈ Ω we have lim n →∞ (cid:107) U n ( ω ) − Z ( ω ) (cid:107) = 0 . Because E B φ ( X ) • ∈ L (Ω , A , P ; H ) wealso know that E B φ ( X ) is finite almost surely. Therefore, for ω in the corresponding co-negligibleset, lim n →∞ |(cid:104) ( E B φ ( X ))( ω ) , U n ( ω ) (cid:105)−(cid:104) ( E B φ ( X ))( ω ) , Z ( ω ) (cid:105)| ≤ lim n →∞ (cid:107) ( E B φ ( X ))( ω ) (cid:107)(cid:107) U n ( ω ) − Z ( ω ) (cid:107) = 0 and lim n →∞ (cid:104) E B φ ( X ) , U n (cid:105) = (cid:104) E B φ ( X ) , Z (cid:105) almost surely.By the same argument it follows that lim n →∞ (cid:104) φ ( X ) , U n (cid:105) = (cid:104) φ ( X ) , Z (cid:105) almost surely. Let h n = (cid:104) φ ( X ) , U n (cid:105) and h = (cid:104) φ ( X ) , Z (cid:105) then | h n − h | ≤ (cid:107) φ ( X ) (cid:107)(cid:107) U n − Z (cid:107) ≤ (cid:107) φ ( X ) (cid:107)(cid:107) Z (cid:107) .Furthermore, | h n | ≤ | h | + 3 (cid:107) φ ( X ) (cid:107)(cid:107) Z (cid:107) ≤ (cid:107) φ ( X ) (cid:107)(cid:107) Z (cid:107) ≤ (cid:107) φ ( X ) (cid:107) + (cid:107) Z (cid:107) ) . The right handside lies in L and dominates h n . Using Shiryaev (1989)[II.§7.Thm.2(a)], we conclude that lim n →∞ E B (cid:104) φ ( X ) , U n (cid:105) = E B (cid:104) φ ( X ) , Z (cid:105) (a.s.)and the result follows.The operator E B is also idempotent and self-adjoint, i.e. E B φ ( X ) = E B ( E B φ ( X )) (a.s) and (cid:104) φ ( X ) • , E B Z • (cid:105) = (cid:104) E B φ ( X ) • , Z • (cid:105) . B.2 Representation of conditional expectations
A well known result in probability theory states that a conditional expectation E S X of a real-valuedrandom variable X given another real-valued random variable S can be written as g ( S ) with somesuitable measurable function g : R → R . This result generalizes to our setting. Here, we include thegeneralized result together with a short proof for reference. Lemma 1
Consider a probability space (Ω , A , P ) , and let H be a separable Hilbert space. Let S : Ω → R d be a random variable and suppose that η : Ω → H is a σ ( S ) -measurable function.There exists a Bochner-measurable function g : R d → H such that η = g ◦ S almost surely. Proof
We first show the statement for simple functions, and observing that any arbitrary Bochner-measurable function can be written as the point-wise limit of a sequence of simple functions, weextend the result to arbitrary η .First, assume that η := hχA for some h ∈ H and A ∈ σ ( S ) . Since S is measurable with respectto B ( R d ) there exists some B ∈ B ( R d ) such that { ω : S ( ω ) ∈ B } = A . Define g : R d → H as g := h ˜ χB , where ˜ χ denotes the indicator function on R d . We obtain, η ( ω ) = hχA ( ω ) = h ˜ χB ( S ( ω )) sothat η = g ◦ S . Next, let η := (cid:80) mi =1 h i χA i for some m ∈ N , h , . . . , h m ∈ H and A , . . . , A m ∈ σ ( S ) . As above, by measurability of S , there exists a sequence B , . . . , B m ∈ B ( R d ) such that A i = S − [ B i ] , i ∈ , . . . , m . It follows that η ( ω ) = (cid:80) mi =1 h i χA i ( ω ) = (cid:80) mi =1 h i ˜ χB i ( S ( ω )) , ω ∈ Ω ;hence, η = g ◦ S for g = (cid:80) mi =1 h i ˜ χB i . Observe that in both cases g is trivially Bochner-measurableby construction, since it is a simple function.Let η : Ω → H be an arbitrary Bochner-measurable function that is also measurable with respectto σ ( S ) . There exists a sequence of simple functions η n , n ∈ N such that for every ω ∈ Ω we have η ( ω ) = lim n →∞ η n ( ω ) . RÜNEWÄLDER AND K HALEGHI
Since each η n is a simple function, by our argument above, there exists a sequence of Bochner-measurable functions g n : R d → H such that η n = g n ◦ S where for each n ∈ N the function g n is simple of the form g n = (cid:80) m n i =1 h i,n ˜ χB i,n for some m n ∈ N and a sequence of functions h ,n , . . . , h m n ,n ∈ H and a sequence of Borel sets B ,n , . . . , B m n ,n ∈ B ( R d ) .Denote by B := { S ( ω ) : ω ∈ Ω } ⊂ R d the image of S , and observe that for each x ∈ B lim n →∞ g n ( x ) exists. To see this, note that by construction, for each x ∈ B we have x = S ( ω ) forsome ω ∈ Ω , thus, it holds that lim n →∞ g n ( x ) = lim n →∞ g n ( S ( ω )) = lim n →∞ η n ( ω ) = η ( ω ) . Moreover, we have P ( S − [ B ]) = P ( { ω ∈ Ω : S ( ω ) ∈ B } ) = P (Ω) = 1 . Define g : R d → H as g ( x ) := (cid:40) lim n →∞ g n ( x ) x ∈ B x / ∈ B (16)Thus, for each ω ∈ Ω with probability , we have η ( ω ) = lim n →∞ η n ( ω ) = lim n →∞ g n ( S ( ω )) = g ( S ( ω )) , (17)so that η = g ◦ S almost surely. On the other hand, since by definition, g is the pointwise limit of asequence of simple functions g n , it is Bochner-measurable, (see Property 1 in Section A.1) and theresult follows. Lemma 2
Consider a separable Hilbert space H , a probability space (Ω , A , P ) , a Bochner-integrable random variable φ ( X ) : Ω → H and a random variable S : Ω → R d . There exists aBochner-measurable function g : R d → H such that E S φ ( X ) = g ◦ S almost surely. Proof
Observing that by definition of conditional expectation, E S φ ( X ) is a σ ( S ) -measurablefunction from Ω to H , the result readily follows from Lemma 1. Appendix C. Proofs
C.1 Proof of Proposition 1Proof
Let M := φ [ X ] denote the manifold corresponding to the image of X under φ , equipped withthe subspace topology and corresponding Borel σ -algebra B ( M ) . Define the metric projection map π : H ⇒ M as a multi-valued function such that π ( g ) = (cid:26) h ∈ M : (cid:107) h − g (cid:107) = min h (cid:48) ∈M (cid:107) h (cid:48) − g (cid:107) (cid:27) . (18)Note that the min operator in Equation (18) is well-defined since by definition h = φ ( x ) for some x ∈ X , the space X is compact and φ is a continuous function. Observe that π is not a function, but amulti-valued function which assigns to each element g ∈ M a subset of M , see, e.g. (Beer, 1993,Section 6.1) for more on this notion. BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS π maps to non-empty compact subsets of M . For each g ∈ H , set f g ( h ) = (cid:107) h − g (cid:107) with h ∈ M , and note that it is a continuous function from M to R . Let m ( g ) := min h ∈M (cid:107) h − g (cid:107) ,which, by the above argument, is well-defined, and observe that, since { m ( g ) } is a closed subset of R , then π ( g ) = f − g [ { m ( g ) } ] is a closed subset of M . Since M is compact as the continuous imageof the compact space X it follows that π ( g ) is compact. π is upper-semicontinuous. As follows from the standard definition, see, e.g. Beer (1993, Defini-tion 6.2.4 and Theorem 6.2.5), the multi-valued function π is said to be upper-semicontinuous at apoint g ∈ H if for any open subset V of M such that π ( g ) ⊆ V it holds that π ( g ) ⊆ V for each g in some neighbourhood of g . To show the upper-semicontinuity of π we proceed as follows. Take g ∈ H . Let V be an open subset of M such that π ( g ) ⊆ V . Denote by (cid:102) M := M \ V . Note that (cid:102) M is compact since it is a closed subset of M which is in turn compact. Therefore, in much the same wayas for M , the min operator is well-defined for (cid:102) M , i.e. the minimum (cid:101) m ( g ) := min h ∈ (cid:102) M (cid:107) g − h (cid:107) exists. Moreover, since π ( g ) ⊆ V and V ∩ (cid:102) M = ∅ , it holds that (cid:101) m ( g ) > m ( g ) . Therefore, thereexists some δ > such that (cid:101) m ( g ) ≥ δ + m ( g ) . Consider an open ball B g ( δ/ of radius δ/ around g . For every g ∈ B g ( δ/ and all h ∈ π ( g ) we have (cid:107) g − h (cid:107) ≤ (cid:107) g − g (cid:107) + (cid:107) g − h (cid:107) ≤ δ/ m ( g ) . (19)On the other hand, we have min h (cid:48) ∈ (cid:102) M (cid:107) g − h (cid:48) (cid:107) ≥ (cid:12)(cid:12) (cid:107) g − g (cid:107) − (cid:107) g − h (cid:48) (cid:107) (cid:12)(cid:12) ≥ m ( g ) + 2 δ/ . (20)This implies that π ( g ) ∩ (cid:102) M = ∅ because there are already better candidates (closer to g ) in π ( g ) which is in turn contained in V and thus does not intersect (cid:102) M ). Hence, it must hold that π ( g ) ⊆ V .Finally, since the choice of g ∈ B g ( δ/ is arbitrary, it follows that for all g ∈ B g ( δ/ we have π ( g ) ⊆ V and π is upper-semicontinuous. φ is a homeomorphism. To see this, note that φ is bijective and continuous since the kernel ispositive definite and continuous: it is by definition surjective and it is injective since φ ( x ) = φ ( y ) for x (cid:54) = y would imply that a (cid:107) φ ( x ) (cid:107) − a a (cid:104) φ ( x ) , φ ( y ) (cid:105) + a (cid:107) φ ( y ) (cid:107) = 0 when a = a = 1 .The statement follows now from Engelking (1989, Theorem 3.1.13) since X is compact and M is aHausdorff space. Measurable selection.
Since π is upper-semicontinuous and maps to compact sets it is usco-compact (Fremlin, 2001, Definition 422A). This implies that π is measurable as a function from H to the compact subsets of M where the latter is equipped with the Vietoris topology and thecorresponding Borel algebra (Fremlin, 2001, Proposition 5A4Db). Furthermore, there exists aBorel-measurable function f from the compact, non-empty, subsets of M to M such that f ( K ) ∈ K for every compact, non-empty, subset K of M . Define W (cid:48) = f ( π ( Z + h ∗ )) then W = φ − ( W (cid:48) ) isthe continuous image of the measurable function W (cid:48) and W has the stated properties.
2. Upper-semicontinuity is also referred to as upper-hemicontinuity for multi-valued functions in the literature. RÜNEWÄLDER AND K HALEGHI
C.2 Proof of Proposition 2Proof (a)
Let W be the random variable provided by Proposition 1 and let W = φ ( W ) − h ∗ . Then (cid:107) Z − W (cid:107) = d ( Z + h ∗ , M ) . Observe that two applications of the Cauchy-Schwarz inequalityyield E ( |(cid:104) f, Z (cid:105) − f ( W ) − (cid:104) f, h ∗ (cid:105)| × χB ) ≤ E |(cid:104) f, ( Z − φ ( W ) − h ∗ ) × χB (cid:105)|≤ (cid:107) f (cid:107) E ( χB × (cid:107) Z − W (cid:107) ) ≤ (cid:112) P ( B ) (cid:107) f (cid:107) (cid:107) Z − W (cid:107) for all f ∈ H . Similarly, for any f ∈ H it holds that E |(cid:104) f, Z − φ ( W ) − h ∗ (cid:105)| ≤ (cid:107) f (cid:107) E (cid:107) Z − W (cid:107) ≤ (cid:107) f (cid:107) (cid:107) Z − W (cid:107) . Noting that Z is H -independent of S we find that for any f ∈ H and B ∈ σ ( S ) | E ( f ( W ) × χB ) − Ef ( W ) P ( B ) | = | E (( f ( W ) − (cid:104) f, h ∗ (cid:105) ) × χB ) − E ( f ( W ) − (cid:104) f, h ∗ (cid:105) ) P ( B ) |≤ | E ( (cid:104) f, Z (cid:105) × χB ) − E (cid:104) f, Z (cid:105) P ( B ) | + (1 + (cid:112) P ( B )) (cid:107) f (cid:107) (cid:107) Z − W (cid:107) ≤ (cid:107) f (cid:107) d ( Z + h ∗ , M ) . (b) For C ∈ σ ( W ) let D be the image of C under W , i.e. D = W [ C ] , D ⊂ X . For f ∈ H let ξ C ( f ) = (cid:107) χD ( W ) − f ( W ) (cid:107) . Now, for any B ∈ σ ( S ) , | P ( C ∩ B ) − E ( f ( W ) × χB ) | ≤ P ( B ) / ( E ( χD ( W ) − f ( W )) ) / ≤ ξ C ( f ) . Moreover, we have | P ( C ) − Ef ( W ) | ≤ ξ C ( f ) . Hence, for any f ∈ H it holds that | P ( C ∩ B ) − P ( C ) P ( B ) | ≤ ξ C ( f ) + | E ( f ( W ) × χB ) − Ef ( W ) P ( B ) |≤ ξ C ( f ) + (cid:107) f (cid:107) d ( Z + h ∗ , M )) . This proves the first part of the proposition. (c)
For the second part: by assumption for A ∈ σ ( Z ) there exists a C ∈ σ ( W ) such that P ( A (cid:52) C ) ≤ c . For any such C we have that | P ( C ) − P ( A ) | ≤ P ( C (cid:52) A ) ≤ c and | P ( C ∩ B ) − P ( A ∩ B ) | ≤ P (( C (cid:52) A ) ∩ B ) ≤ c. Hence, | P ( A ∩ B ) − P ( A ) P ( B ) | ≤ c + 2( ξ C ( f ) + (cid:107) f (cid:107) d ( Z + h ∗ , M )) for all f ∈ H . Taking theinfimum over f and C proves the second part of the proposition. BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
C.3 Proof of Proposition 3
First note that since φ is continuous and X is compact it follows that ρ is finite and (cid:107) Z (cid:107) = (cid:107) φ ( X ) − E S φ ( X ) + Eφ ( X ) (cid:107)≤ (cid:107) φ ( X ) (cid:107) + (cid:107) E S φ ( X ) (cid:107) + (cid:107) Eφ ( X ) (cid:107)≤ (cid:107) φ ( X ) (cid:107) + E S (cid:107) φ ( X ) (cid:107) + E (cid:107) φ ( X ) (cid:107) (21) ≤ ρ where (21) follows from Diestel and Uhl (1977, Theorem II.4) and Pisier (2016, Proposition 1.12).Let Z i , i ≤ n be n independent copies of Z and define Y i := min h ∈M (cid:107) Z i + h ∗ − h (cid:107) , i ≤ n , and Y := min h ∈M (cid:107) Z + h ∗ − h (cid:107) . By Hoeffding’s inequality we have, Pr (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 Y i − EY (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:33) ≤ (cid:16) − n(cid:15) ρ (cid:17) and the result follows. C.4 Proof of Proposition 4Proof (a)
We first show that (cid:104) E S φ ( X ) • , ( Z (cid:48) ) • (cid:105) = (cid:104) E ( φ ( X ) • ) , ( Z (cid:48) ) • (cid:105) . (22) E S φ ( X ) • is an element of L (Ω , σ ( S ) , P ; H ) and there exists a sequence of simple function { U n } n ∈ N such that lim n →∞ (cid:107) U • n − φ ( X ) • (cid:107) = 0 . In particular, lim n →∞ (cid:104) U • n , ( Z (cid:48) ) • (cid:105) = (cid:104) φ ( X ) • , ( Z (cid:48) ) • (cid:105) and (cid:107) E ( U • n ) − E ( φ ( X ) • ) (cid:107) ≤ E (cid:107) U • n − φ ( X ) • (cid:107) = (cid:107) U • n − φ ( X ) • (cid:107) goes to zero in n . Consider some U n = (cid:80) mi =1 h i × χA i , h i ∈ H , A i ∈ σ ( S ) , m ∈ N , and observe that (cid:104) U • n , ( Z (cid:48) ) • (cid:105) = m (cid:88) i =1 E (cid:104) h i × χA i , Z (cid:48) (cid:105) = m (cid:88) i =1 E ( (cid:104) h i , Z (cid:48) (cid:105) × χA i ) = m (cid:88) i =1 E (cid:104) h i , Z (cid:48) (cid:105) × E ( χA i ) , using the assumption on Z (cid:48) . The assumption can be applied because χA i is σ ( S ) -measurable, and,hence, can be written as a function of S Shiryaev (1989)[II.§4.Thm.3]. Now, m (cid:88) i =1 E (cid:104) h i , Z (cid:48) (cid:105) × E ( χA i ) = E (cid:104) m (cid:88) i =1 h i × E ( χA i ) , Z (cid:48) (cid:105) = E (cid:104) E ( U • n ) , Z (cid:48) (cid:105) and (cid:104) U • n , ( Z (cid:48) ) • (cid:105) = (cid:104) E ( U • n ) , ( Z (cid:48) ) • (cid:105) . Equation (22) follows since U • n converges to φ ( X ) • and E ( U • n ) converges to E ( φ ( X ) • ) = 0 in L (Ω , A , P ; H ) . (b) Since (cid:104) E S φ ( X ) • , ( Z (cid:48) ) • (cid:105) = (cid:104) E ( φ ( X ) • ) , ( Z (cid:48) ) • (cid:105) and (cid:104) φ ( X ) • , E S φ ( X ) • (cid:105) = (cid:107) E S φ ( X ) • (cid:107) it follows right away that (cid:107) φ ( X ) • − ( Z (cid:48) ) • (cid:107) = (cid:107) φ ( X ) • − Z • (cid:107) + 2 (cid:104) E S φ ( X ) • − E ( φ ( X ) • ) , φ ( X ) • − E S φ ( X ) • + E ( φ ( X ) • ) − ( Z (cid:48) ) • (cid:105) + (cid:107) Z • − ( Z (cid:48) ) • (cid:107) = (cid:107) φ ( X ) • − Z • (cid:107) + (cid:107) Z • − ( Z (cid:48) ) • (cid:107) . RÜNEWÄLDER AND K HALEGHI
Hence, Z is a minimizer and it is almost surely unique because (cid:107) Z • − ( Z (cid:48) ) • (cid:107) is only zero if Z • = ( Z (cid:48) ) • . C.5 Proof of Proposition 5Proof (a)
In the following, let s , . . . , s l be the values S can attain. Furthermore, let f i = E n ( φ ( X ) | S = s i ) − E ( φ ( X ) | S = s i ) , and let F = σ ( X , S , . . . , X n , S n ) . Each f i is F -measurable. Observe that for i (cid:54) = j , E F ( (cid:104) f i × χ { S = s i } , f j × χ { S = s j }(cid:105) ) = E F ( (cid:104) f i , f j (cid:105) × χ { S = s i , S = s j } )= (cid:104) f i , f j (cid:105) · E F ( χ { S = s i , S = s j } ) = (cid:104) f i , f j (cid:105) · P ( S = s i , S = s j ) = 0 since f i , f j are F -measurable and S is independent of F . Hence, E F ( (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) ) = E F (cid:16)(cid:13)(cid:13)(cid:13) l (cid:88) i =1 f i × χ { S = s i } (cid:13)(cid:13)(cid:13) (cid:17) = l (cid:88) i =1 E F ( (cid:107) f i × χ { S = s i }(cid:107) )= l (cid:88) i =1 E F ( (cid:107) f i (cid:107) × χ { S = s i } )= l (cid:88) i =1 (cid:107) f i (cid:107) P ( S = s i )= l (cid:88) i =1 P ( S = s i ) sup (cid:107) h (cid:107)≤ | E n ( h ( X ) | S = s i ) − E ( h ( X ) | S = s i ) | . (b) For each i either P ( S = s i ) = 0 or sup (cid:107) h (cid:107)≤ | E n ( h ( X ) | S = s i ) − E ( h ( X ) | S = s i ) | ∈ O ∗ P ( n − ) using Grünewälder (2018). Since there are only l -many terms in the sum this result carries over tothe whole sum. C.6 Proof of Proposition 6Proof
Recall the notation D (cid:96) := { ∆ i : i ∈ , . . . , (cid:96) d } , (cid:96) ∈ N where ∆ , ∆ , . . . , ∆ (cid:96) d are thedyadic cubes ∆ , ∆ , . . . , ∆ (cid:96) d of side-length /(cid:96) discretizing S . Let G := σ ( { S − [∆] : ∆ ∈ D (cid:96) } ) and choose a Bochner measurable g : S → H according to Lemma 2 such that g ( S ) = E S φ ( X ) (a.s.). Since G ⊆ σ ( S ) we have, E G φ ( X ) = E G ( E S φ ( X ))) = E G ( g ( S )) almost surely . (23) BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
In the following, we use g ◦ S instead of g ( S ) for readability. With probability one it holds that, E F (cid:107) g ◦ S − E G ( g ◦ S ) (cid:107) = E F (cid:16) (cid:88) ∆ ∈ D (cid:96) (cid:107) g ◦ S − E G ( g ◦ S ) (cid:107) χ { S ∈ ∆ } (cid:17) = (cid:88) ∆ ∈ D (cid:96) E F (cid:107) ( g ◦ S − E G ( g ◦ S )) χ { S ∈ ∆ }(cid:107) = (cid:88) ∆ ∈ D (cid:96) E F (cid:107) ( g ◦ S − (cid:88) ∆ (cid:48) ∈ D (cid:96) E ( g ◦ S | S ∈ ∆ (cid:48) ) χ { S ∈ ∆ (cid:48) } ) χ { S ∈ ∆ }(cid:107) = (cid:88) ∆ ∈ D (cid:96) E F (cid:0) (cid:107) g ◦ S − E ( g ◦ S | S ∈ ∆) (cid:107) χ { S ∈ ∆ } (cid:1) By Diestel and Uhl (1977, II.Corollary 8) for any ∆ ∈ D (cid:96) it holds that the conditional expectation of g given ∆ is in the closed convex hull of g [∆] := { g ( s ) : s ∈ ∆ } . That is, µ (∆) (cid:90) ∆ g dµ ∈ cch( g [∆]) This means that for every (cid:15) > there exist k ∈ N , and some s , . . . , s k ∈ ∆ and α , . . . , α k > with (cid:80) kj =1 α i = 1 such that (cid:13)(cid:13)(cid:13) µ (∆) (cid:90) ∆ g dµ − k (cid:88) j =1 α j g ( s j ) (cid:13)(cid:13)(cid:13) ≤ (cid:15). Let D := S − [∆] . We obtain µ (∆) (cid:90) ∆ g dµ = 1 P ( D ) (cid:90) D ( g ◦ S ) dP = E ( g ◦ S | S ∈ ∆) . Since g ◦ S is assumed to be L -Lipschitz-continuous, for all ∆ ∈ D (cid:96) we have (cid:107) g ◦ S − k (cid:88) j =1 α j g ( s j ) (cid:107) χ { S ∈ ∆ }≤ sup s ∈ ∆ (cid:107) k (cid:88) j =1 α j ( g ( s ) − g ( s j )) (cid:107) ≤ sup s ∈ ∆ ( k (cid:88) j =1 α j (cid:107) g ( s ) − g ( s j ) (cid:107) ) ≤ (cid:16) k (cid:88) j =1 α j sup s ∈ ∆ (cid:107) g ( s ) − g ( s j ) (cid:107) (cid:17) ≤ L (cid:16) k (cid:88) j =1 α j sup s ∈ ∆ (cid:107) s − s j (cid:107) (cid:17) ≤ dL (cid:96) − . RÜNEWÄLDER AND K HALEGHI
Moreover, noting that χ {·} = χ {·} we obtain, (cid:107) g ◦ S − (cid:80) kj =1 α j g ( s j ) (cid:107) χ { S ∈ ∆ } ≤ L √ d(cid:96) − . Itfollows that, (cid:107) ( g ◦ S − E ( g ◦ S | S ∈ ∆)) χ { S ∈ ∆ }(cid:107) = (cid:107) ( g ◦ S − E ( g ◦ S | S ∈ ∆)) (cid:107) χ { S ∈ ∆ }≤ (cid:16)(cid:13)(cid:13) g ◦ S − k (cid:88) j =1 α j g ( s j ) (cid:13)(cid:13) + (cid:13)(cid:13) k (cid:88) j =1 α j g ( s j ) − E ( g ◦ S | S ∈ ∆) (cid:13)(cid:13)(cid:17) χ { S ∈ ∆ }≤ (cid:16)(cid:13)(cid:13) ( g ◦ S − k (cid:88) j =1 α j g ( s j )) (cid:13)(cid:13) + (cid:15) (cid:17) χ { S ∈ ∆ } . Since this holds for every (cid:15) > we have, (cid:107) ( g ◦ S − E ( g ◦ S | S ∈ ∆)) χ { S ∈ ∆ }(cid:107) ≤ dL (cid:96) − . Observe that for ∆ (cid:54) = ∆ (cid:48) , ∆ , ∆ (cid:48) ∈ D (cid:96) , E F (cid:16) (cid:107) g ◦ S − E ( g ◦ S | S ∈ ∆) (cid:107) χ { S ∈ ∆ } × (cid:107) g ◦ S − E ( g ◦ S | S ∈ ∆ (cid:48) ) (cid:107) χ { S ∈ ∆ (cid:48) } (cid:17) = 0 and E F (cid:107) g ◦ S − E G ( g ◦ S ) (cid:107) = (cid:88) ∆ ∈ D (cid:96) E F (cid:107) g ◦ S − E ( g ◦ S | S ∈ ∆) (cid:107) χ { S ∈ ∆ } = (cid:88) ∆ ∈ D (cid:96) E F (cid:107) g ◦ S − E ( g ◦ S | S ∈ ∆) (cid:107) χ { S ∈ ∆ } = (cid:88) ∆ ∈ D (cid:96) E F (cid:107) ( g ◦ S − E ( g ◦ S | S ∈ ∆)) χ { S ∈ ∆ }(cid:107) χ { S ∈ ∆ }≤ dL (cid:96) − . In particular, E F (cid:107) g ◦ S − E G φ ( X ) (cid:107) ≤ dL (cid:96) − . (24)On the other hand, in much the same way as in the proof of Proposition 5, we have E F ( (cid:107) E Sn φ ( X ) − E G φ ( X ) (cid:107) ) = (cid:88) ∆ ∈ D (cid:96) P ( S ∈ ∆) sup (cid:107) h (cid:107)≤ | E n ( h ( X ) | S ∈ ∆) − E ( h ( X ) | S ∈ ∆) | . Let U := ( X, S ) and define the push forward measure ν := P ◦ U − of P onto X × S under U .Set ν n := n (cid:80) ni =1 δ ( X i ,S i ) where δ ( X,S ) denotes the measure that has point mass at ( X, S ) . Definethe projection map π : X × S → X which maps a tuple ( x, s ) ∈ X × S to its first element so that π (( x, s )) = x . For each h ∈ H such that (cid:107) h (cid:107) ≤ and every ∆ ∈ D (cid:96) we obtain | E n ( h ( X ) | S ∈ ∆) − E ( h ( X ) | S ∈ ∆) | = | E n ( h ( π ( U )) | U ∈ X × ∆) − E ( h ( π ( U )) | U ∈ X × ∆) | = (cid:12)(cid:12)(cid:12)(cid:90) X × ∆ h ◦ π dν n − (cid:90) X × ∆ h ◦ π dν (cid:12)(cid:12)(cid:12) . BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
For each (cid:96) ∈ N define C (cid:96) := { X × ∆ : ∆ ∈ D (cid:96) } . By assumption, H C = { h × χD : h ∈H , (cid:107) h (cid:107) ≤ , D ∈ (cid:83) (cid:96) ∈ N C (cid:96) } is P -Donsker and for D ∈ C (cid:96) , with D = X × ∆ i for i ≤ (cid:96) , ν ( D ) = P S − [∆ i ] ≥ b(cid:96) − d . For a given α ∈ (0 , / let (cid:96) be (cid:98) n α/d (cid:99) so that (cid:96) − d ≥ n − α . Similarlyto (Grünewälder, 2018, Proposition 3.2) it follows that there exists a constant M such that for all n ≥ and corresponding (cid:96) , sup (cid:107) h (cid:107)≤ sup C ∈ C (cid:96) (cid:12)(cid:12)(cid:90) C h ◦ π dν n − (cid:90) C h ◦ π dν (cid:12)(cid:12) ≤ M (cid:96) d n − / /b. Thus, E F ( (cid:107) E Sn φ ( X ) − E G φ ( X ) (cid:107) ) ≤ M (cid:96) d /nb . (25)Using Equation (24) and (25) as well as the Cauchy-Schwarz inequality for conditional expectationswe obtain, E F ( (cid:107) E Sn φ ( X ) − E S φ ( X ) (cid:107) ) ≤ E F ( (cid:107) E Sn φ ( X ) − E G φ ( X ) (cid:107) + (cid:107) E G φ ( X ) − E S φ ( X ) (cid:107) ) ≤ M (cid:96) d /nb + 4 M √ dL(cid:96) d − n − / /b + dL (cid:96) − . Because (cid:96) = (cid:98) n α/d (cid:99) the upper-bound becomes M n α − /b + dL / ( n − α/d − + 4 M √ dLn α (1 − d ) − /b. We claim that the rate of convergence in n is optimized by α ∗ = d/ d + 1) : For α ≥ α ∗ we have α − ≥ α (1 − /d ) − / ≥ − α/d and the dominant term α − is minimized at α ∗ . On the other hand, for α ≤ α ∗ , α − ≤ − α/d and α (1 − /d ) − / ≤ − α/d. In this case the dominant term is also minimized for α ∗ . Therefore, we must set (cid:96) ∗ = (cid:98) n α ∗ /d (cid:99) = (cid:98) n d +1) (cid:99) . Appendix D. Solution to the oblivious kernel ridge regression optimization problem
Define z i := ( (cid:104) Z , Z i (cid:105) · · · (cid:104) Z n , Z i (cid:105) ) (cid:62) , i ∈ ..n , and observe that O = | | . . . | z z . . . z n | | . . . | . RÜNEWÄLDER AND K HALEGHI
Let ˆ f be the minimizer of the regularized least-squares error as given by (11). By the represen-ter theorem there exist scalars α , . . . , α n such that ˆ f = (cid:80) nj =1 α j Z j . It follows that (cid:104) ˆ f , Z i (cid:105) = (cid:80) nj =1 α j (cid:104) Z j , Z i (cid:105) so that, n (cid:88) i =1 ( (cid:104) ˆ f , Z i (cid:105) − Y i ) + λ (cid:107) ˆ f (cid:107) = ( O α − y )( O α − y ) (cid:62) + λ α (cid:62) O α (26)where α := ( α , . . . , α n ) (cid:62) and y := ( Y , . . . , Y n ) (cid:62) . Noting that ˆ f is the minimizer, and thus takingthe gradient of (26) with respect to α we obtain, ∇ α (cid:16) ( O α − y )( O α − y ) (cid:62) + λ α (cid:62) O α (cid:17) = 0 . Solving for α and noting that O is symmetric, we obtain α = O − (cid:16) O (cid:62) + λI (cid:17) − O (cid:62) y = O − (cid:16) O (cid:62) + λI (cid:17) − O y since O is symmetric = O − (cid:16) O (cid:62) + λI (cid:17) − ( O − ) − y = (cid:16) O − (cid:16) O (cid:62) + λI (cid:17) O (cid:17) − y = (cid:16) ( O − O + λ O − ) O (cid:17) − y since O is symmetric = ( O + λI ) − y . Appendix E. Algorithms
We discuss three algorithms in this section: an algorithm to calculate the oblivious kernel matrix(Section E.1), an algorithm to calculate (cid:104) Z , Z i (cid:105) which is needed for prediction (Section E.2), and analgorithm to calculate W , the projection of Z i onto M , which also allows us to estimate the distancebetween Z and M (Section E.3). E.1 Calculating the oblivious kernel matrix
We start by deriving the algorithm for calculating the oblivious matrix. The result algorithm issummarized in Algorithm 1 on page 31. Throughout we assume that A , . . . , A l is a partition of S and we assume that n samples ( X i , S i ) are available. The algorithm splits the data into two parts ofsize n and uses the samples n + 1 , . . . , n to estimate the conditional expectation. The remaining n samples are then used to generate the features Z i , i = 1 , . . . , n . The features Z i will not be explicitlystored. The only thing that will be stored is the oblivious matrix O . To calculate the oblivious matrixwe only need kernel evaluations. To see this consider any i ≤ n , then Z i = φ ( X i ) − E S i n φ ( X ) = φ ( X i ) − l (cid:88) u =1 E n ( φ ( X ) | S ∈ A u ) × χ { S i ∈ A u } . BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
Algorithm 1
Generating the oblivious kernel matrix; the sum over an empty index set is treated as Input: data ( x , s ) , . . . ( x n , s n ) , disjoint sets A , . . . , A (cid:96) which cover S set M = (cid:80) ni = n +1 (cid:80) nj = n +1 k ( x i , x j ) /n set I i = ∅ , i ∈ , . . . , (cid:96) for i = n + 1 to n do find index u such that s i ∈ A u update I u ← I u ∪ { i } end forfor i = 1 to n do set ρ i = (cid:80) nu = n +1 k ( x i , x u ) /n for a = 1 to l do set ξ i,a = (cid:80) u ∈I a k ( x i , x u ) / |I a | end forend forfor a = 1 to l do set τ a = (cid:80) u ∈I a (cid:80) nv = n +1 k ( x u , x v ) / ( n |I a | ) for b = 1 to l do set o a,b = (cid:80) u ∈I a ,v ∈I b k ( x u , x v ) / ( |I a ||I b | ) end forend forfor i = 1 to n dofor j = i to n do set a such that s j ∈ A a set b such that s i ∈ A b set O i,j = k ( x i , x j ) − ξ i,a − ξ j,b + o a,b + M − ρ i − ρ j − τ a − τ b set O j,i = O i,j end forend forReturn: O For u = 1 , . . . , l let N u = n (cid:88) v = n +1 χ { S v ∈ A u } be the number of samples with indices within n + 1 , . . . , n that fall into set A u . The estimate of theelementary conditional expectation is E n ( φ ( X ) | S ∈ A u ) = 1 N u n (cid:88) v = n +1 φ ( X v ) × χ { S v ∈ A u } , which attains values in H . RÜNEWÄLDER AND K HALEGHI
Now consider the inner product between Z i and Z j , i, j ≤ n : (cid:104) Z i , Z j (cid:105) = (cid:104) φ ( X i ) , φ ( X j ) (cid:105) − (cid:104) φ ( X i ) , E S j n φ ( X ) (cid:105) − (cid:104) E S i n φ ( X ) , φ ( X j ) (cid:105) + (cid:104) E S i n φ ( X ) , E S j n φ ( X ) (cid:105) + (cid:104) φ ( X i ) , E n ( φ ( X )) (cid:105) + (cid:104) E n ( φ ( X )) , φ ( X j ) (cid:105) − (cid:104) E S i n φ ( X ) , E n ( φ ( X )) (cid:105)− (cid:104) E n ( φ ( X )) , E S j n φ ( X ) (cid:105) + (cid:104) E n ( φ ( X )) , E n ( φ ( X )) (cid:105) . This reduces to calculations involving only the kernel function and no other functions from H . Indetail, (cid:104) φ ( X i ) , φ ( X j ) (cid:105) = k ( X i , X j ) , and (cid:104) φ ( X i ) , E S j n φ ( X ) (cid:105) = l (cid:88) u =1 (cid:104) φ ( X i ) , E n ( φ ( X ) | S ∈ A u ) (cid:105) × χ { S j ∈ A u } , where (cid:104) φ ( X i ) , E n ( φ ( X ) | S ∈ A u ) (cid:105) = 1 N u n (cid:88) l = n +1 (cid:104) φ ( X i ) , φ ( X l ) (cid:105) × χ { S l ∈ A u } = 1 N u n (cid:88) l = n +1 k ( X i , X l ) × χ { S l ∈ A u } . The inner product (cid:104) E n ( φ ( X ) | S ∈ A u ) , φ ( X j ) (cid:105) can be calculated in the same way. Furthermore, (cid:104) E S i n φ ( X ) , E S j n φ ( X ) (cid:105) = l (cid:88) u =1 l (cid:88) v =1 (cid:104) E n ( φ ( X ) | S ∈ A u ) , E n ( φ ( X ) | S ∈ A v ) (cid:105) × χ { S i ∈ A u , S j ∈ A v } and (cid:104) E n ( φ ( X ) | S ∈ A u ) , E n ( φ ( X ) | S ∈ A v ) (cid:105) = 1 N u N v n (cid:88) l = n +1 2 n (cid:88) m = n +1 (cid:104) φ ( X l ) , φ ( X m ) (cid:105) × χ { S l ∈ A u , S m ∈ A v } = 1 N u N v n (cid:88) l = n +1 2 n (cid:88) m = n +1 k ( X l , X m ) × χ { S l ∈ A u , S m ∈ A v } . The terms involving E n ( φ ( X )) = (1 /n ) (cid:80) ni = n +1 φ ( X i ) are reduced in a similar way to kernelevaluations. Combining these calculations leads to Algorithm 1. E.2 Prediction based on oblivious features
To be able to predict labels for new observations ( X, S ) in a regression or classification setting weneed to transform ( X, S ) into an oblivious feature Z . The approach to do is the same as for thetraining data. In particular, the conditional expectation estimates E S j n φ ( X ) are needed to transform ( X, S ) into Z . For kernel methods Z itself is never calculated explicitly but it appears in algorithmsin the form of inner products (cid:104) Z , Z i (cid:105) , where i ≤ n and Z i are the oblivious features corresponding tothe training set. These inner product can be calculated in exactly the same way as the inner products (cid:104) Z i , Z j (cid:105) in Section E.1. BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
E.3 Projecting the oblivious features onto the manifold
The quadratic distance between Z , or more precisely Z ( ω ) , and M in H is equal to inf x ∈ X (cid:107) Z − φ ( x ) (cid:107) = (cid:107) Z (cid:107) + inf x ∈ X ( k ( x, x ) − (cid:104) Z , φ ( x ) (cid:105) ) . The constant (cid:107) Z (cid:107) is of no relevance and we are looking for a minimum (when this is well-defined)of the function f ( x ) = k ( x, x ) − (cid:104) Z , φ ( x ) (cid:105) in X . Using the conditional expectation E Sn φ ( X ) and Z = φ ( X ) − E Sn φ ( X ) + E ( φ ( X )) we canrewrite f ( x ) as f ( x ) = k ( x, x ) − k ( X, x ) − E Sn k ( X, x ) + E ( k ( X, x )) . The function f is ( α, L ) -Hölder continuous whenever k ( x, · ) is ( α, L (cid:48) ) -Hölder-continuous for all x ∈ X with L = 8 L (cid:48) , since then | f ( x ) − f ( y ) | ≤ | k ( x, x ) − k ( y, y ) | + 2( | k ( X, x ) − k ( X, y ) | + E Sn | k ( X, y ) − k ( X, x ) | + E | k ( X, x ) − k ( X, y ) | ) ≤ | k ( x, x ) − k ( x, y ) | + | k ( x, y ) − k ( y, y ) | + 6 L (cid:48) (cid:107) y − x (cid:107) α = 8 L (cid:48) (cid:107) y − x (cid:107) α . This property of f is useful because various kernel functions are Hölder-continuous and efficientalgorithms are available to optimize Hölder-continuous functions. In particular, there exist classicalglobal optimization algorithms (Vanderbei, 1997) and bandit algorithms (Munos, 2014) for this task.The projection of Z onto M can also be used directly to approximate d n ( Z + h ∗ , M ) and, byapplying Proposition 3, to estimate d ( Z + h ∗ , M ) . References
G. Beer.
Topologies on closed and closed convex sets , volume 268. Springer Science & BusinessMedia, 1993.D. P. Bertsekas and J. N. Tsitsiklis.
Introduction to Probability . Athena Scientific, 1st edition, 2002.R.V. Bradley.
Introduction to Strong Mixing Conditions, Vols. 1, 2 and 3 . Kendrick Press, 2007.T. Calders, F. Kamiran, and M. Pechenizkiy. Building classifiers with independency constraints. In , 2009.F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney. Optimizedpreprocessing for discrimination prevention. In
In Advances in Neural Information ProcessingSystems , 2017.J. Diestel and J.J. Uhl.
Vector measures . American Mathematical Soc., 1977.M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk minimizationunder fairness constraints. In
Advances in Neural Information Processing Systems , 2018.P. Doukhan.
Mixing: Properties and Examples . Springer Lecture Notes, 1994. RÜNEWÄLDER AND K HALEGHI
R.M. Dudley.
Uniform Central Limit Theorems . Cambridge University Press, 2nd edition, 2014.R. Engelking.
General Topology . Heldermann Verlag Berlin, 1989.D.H. Fremlin.
Measure Theory . Torres Fremlin, 2001.E. Giné and R. Nickl.
Mathematical Foundations of Infinite-dimensional Statistical Models . Cam-bridge University Press, 2016.A. Gretton, K. Fukumizu, CH. Teo, L. Song, B. Schölkopf, and AJ. Smola. A kernel statistical testof independence. In
Advances in neural information processing systems , 2008.S. Grünewälder. Plug-in estimators for conditional expectations and probabilities. In
Proceedings ofthe Twenty-First International Conference on Artificial Intelligence and Statistics , 2018.M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In
Advances inNeural Information Processing Systems , 2016.M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextualbandits. In
In Advances in Neural Information Processing Systems , 2016.N. Kilbertus, M. Rojas-Carulla G., Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf. Avoidingdiscrimination through causal reasoning. In
In Advances in Neural Information Processing Systems ,2017.J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent Trade-Offs in the Fair Determinationof Risk Scores. In Christos H. Papadimitriou, editor, , volume 67. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017.M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In
In Advances in NeuralInformation Processing Systems , 2017.C. Louizos, K. Swersky, Y. Li, M. Welling, and R. S. Zemel. The variational fair autoencoder. In
International Conference on Learning Representations , 2015.D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferablerepresentations. In
International Conference on Machine Learning , 2018.D.J. Marcus. Relationships between donsker classes and sobolev spaces.
Zeitschrift für Wahrschein-lichkeitstheorie und Verwandte Gebiete , 1985.R. Munos. From bandits to monte-carlo tree search: The optimistic principle applied to optimizationand planning.
Foundations and Trends in Machine Learning , 2014.G. Pisier.
Martingales in Banach Spaces . Cambridge Studies in Advanced Mathematics. CambridgeUniversity Press, 2016.A. Shiryaev.
Probability . Springer: Graduate Texts in Mathematics, second edition, 1989.B. K. Sriperumbudur, K. Fukumizu, and G.R.G. Lanckriet. Universality, characteristic kernels andrkhs embedding of measures.
Journal of Machine Learning Research , 2011. BLIVIOUS D ATA FOR F AIRNESS WITH K ERNELS
R.J. Vanderbei. Extension of piyavskii’s algorithm to continuous global optimization. Technicalreport, Princeton University, 1997.M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K.P. Gummadi. Fairness beyond disparate treatment& disparate impact: Learning classification without disparate mistreatment. In
Proceedings of the26th international conference on world wide web , 2017.R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In
InInternational Conference on Machine Learning , 2013., 2013.