On the Global Optima of Kernelized Adversarial Representation Learning
OOn the Global Optima of Kernelized Adversarial Representation Learning
Bashir SadeghiMichigan State University [email protected]
Runyi YuEastern Mediterranean University [email protected]
Vishnu Naresh BoddetiMichigan State University [email protected]
Abstract
Adversarial representation learning is a promisingparadigm for obtaining data representations that are invari-ant to certain sensitive attributes while retaining the infor-mation necessary for predicting target attributes. Existingapproaches solve this problem through iterative adversar-ial minimax optimization and lack theoretical guarantees.In this paper, we first study the “linear” form of this prob-lem i.e., the setting where all the players are linear func-tions. We show that the resulting optimization problem isboth non-convex and non-differentiable. We obtain an ex-act closed-form expression for its global optima throughspectral learning and provide performance guarantees interms of analytical bounds on the achievable utility andinvariance. We then extend this solution and analysis tonon-linear functions through kernel representation. Numer-ical experiments on UCI, Extended Yale B and CIFAR-100datasets indicate that, (a) practically, our solution is idealfor “imparting” provable invariance to any biased pre-trained data representation, and (b) empirically, the trade-off between utility and invariance provided by our solutionis comparable to iterative minimax optimization of exist-ing deep neural network based approaches. Code is avail-able at https://github.com/human-analysis/Kernel-ARL
1. Introduction
Adversarial representation learning (ARL) is a promis-ing framework for training image representation modelsthat can control the information encapsulated within it.ARL is practically employed to learn representations fora variety of applications, including, unsupervised domainadaptation of images [8], censoring sensitive informationfrom images [7], learning fair and unbiased representations[20, 21], learning representations that are controllably in-variant to sensitive attributes [29] and mitigating unintendedinformation leakage [26], amongst others.At the core of the ARL formulation is the idea of jointlyoptimizing three entities: (i) An encoder E that seeks to x E z T ˆ y A ˆ s Figure 1:
Adversarial Representation Learning consistsof three entities, an encoder E that obtains a compact rep-resentation z of input data x , a predictor T that predictsa desired target attribute y and an adversary that seeks toextract a sensitive attribute s , both from the embedding z .distill the information from input data and retains the in-formation relevant to a target task while intentionally and permanently eliminating the information corresponding to asensitive attribute, (ii) A predictor T that seeks to extract adesired target attribute, and (iii) A proxy adversary A , play-ing the role of an unknown adversary, that seeks to extract aknown sensitive attribute. Figure 1 shows a pictorial illus-tration of the ARL problem.Typical instantiations of ARL represent these entitiesthrough non-linear functions in the form of deep neuralnetworks (DNNs) and formulate parameter learning as aminimax optimization problem. Practically, optimization isperformed through simultaneous gradient descent, wherein,small gradient steps are taken concurrently in the param-eter space of the encoder, predictor and proxy adversary.The solutions thus obtained have been effective in learningdata representations with controlled invariance across ap-plications such as image classification [26], multi-lingualmachine translation [29] and domain adaptation [8].Despite its practical promise, the aforementioned ARLsetup suffers from a number of drawbacks: – The minimax formulation of ARL leads to an optimizationproblem that is non-convex in the parameter space, both dueto the adversarial loss function as well as due to the non-linear nature of modern DNNs. As we show in this paper, a r X i v : . [ c s . L G ] D ec F ( x ; Θ F ) x ∈ R d φ ( x ) z = Θ E φ ( x ) Θ y z + b y ˆ y Θ s z + b s ˆ s Figure 2:
Overview:
Illustration of adversarial representa-tion learning for imparting invariance to a fixed biased pre-trained image representation x = F ( x (cid:48) ; Θ F ) . An encoder E , in the form of a kernel mapping, produces a new repre-sentation z . A target predictor and an adversary, in the formof linear regressors, operate on this new representation. Wetheoretically analyze this ARL setup to obtain a closed formsolution for the globally optimal parameters of the encoder Θ E . Provable bounds on the achievable trade-off betweenthe utility and fairness of the representation are also derived.even for simple instances of ARL where each entity is char-acterized by a linear function, the problem remains non-convex in the parameter space. Similar observations [25]have been made in a different but related context of adver-sarial learning in generative adversarial networks (GANs)[12]. – Current paradigm of simultaneous gradient descent tosolve the ARL problem provides no provable guaranteeswhile suffering from instability and poor convergence [26,21]. Again, similar observations on such limitations havebeen made [22, 25] in the context of GANs. – In applications of ARL related to fairness, accountabilityand transparency of machine learning models, it is criticallyimportant to provide performance bounds in addition to em-pirical evidence of model efficacy. A major shortcoming ofexisting DNN based ARL solutions is the lack of theoret-ical analysis or provable bounds on achievable utility andfairness.In this paper, we take a step back and analytically studythe simplest version of the ARL problem from an optimiza-tion perspective with the goal of addressing the aforemen-tioned drawbacks. Doing so enables us to delineate the con-tributions of the expressivity of the entities in ARL (i.e.,shallow vs deep models) and the challenges of optimizingthe parameters (i.e., local optima through simultaneous gra-dient descent vs global optima).
Contributions:
We first consider the “linear” form of ARL,where the encoder is a linear transformation, the target pre-dictor is a linear regressor and proxy adversary is a lin-ear regressor. We show that this Linear-ARL leads to anoptimization problem that is both non-convex and non-differentiable. Despite this fact, by reducing it into a setof trace problems on a Stiefel manifold, we obtain an exactclosed form solution for the global optima. As part of our solution, we also determine optimal dimensionality of theembedding space. We then obtain analytical bounds (lowerand upper) on the target and adversary objectives and pre-scribe a procedure to explicitly control the maximal leakageof sensitive information. Finally, we extend the Linear-ARLformulation to allow non-linear functions through a kernelextension while still enjoying an exact closed-form solutionfor the global optima. Numerical experiments on multipledatasets, both small and large scale, indicate that the globaloptima solution for the linear and kernel formulations ofARL are competitive and sometimes even outperform DNNbased ARL trained through simultaneous stochastic gradi-ent descent. Practically, we also demonstrate the utility ofLinear-ARL and Kernel-ARL for “imparting” provable in-variance to any biased pre-trained data representation. Fig-ure 2 provides an overview of our contributions. We referto our proposed algorithm for obtaining the global optimaas Spectral-ARL and abbreviate it as SARL.
Notation:
Scalars are denoted by regular lower case orGreek letters, e.g . n , λ . Vectors are denoted by boldfacelowercase letters, e.g . x , y . Matrices are uppercase bold-face letters, e.g . X . A k × k identity matrix is denoted by I k or I . Centered (mean subtracted w.r.t. columns) datamatrix is indicated by “˜”, e.g . ˜ X . Assume that X contains n columns, then ˜ X = XD , where D = I n − n T and denotes the vector of ones with length of n . Given ma-trix M ∈ R m × m , we use Tr[ M ] to denote its trace (i.e.,the sum of its diagonal elements); its Frobenius norm is de-noted by (cid:107) M (cid:107) F , which is related to the trace as (cid:107) M (cid:107) F =Tr[ MM T ] = Tr[ M T M ] . The subspace spanned by thecolumns of M is denoted by R ( M ) or simply M (in cal-ligraphy); the orthogonal complement of M is denoted by M ⊥ . The null space of M is denoted by N ( M ) . The or-thogonal projection onto M is P M = M ( M T M ) † M T ,where superscript “ † ” indicates the Moore-Penrose pseudoinverse [19].Let x ∈ R d be a random vector. We denote its expec-tation by E [ x ] , and its covariance matrix by C x ∈ R d × d as C x = E (cid:2) ( x − E [ x ])( x − E [ x ]) T (cid:3) . Similarly, the cross-covariance C xy ∈ R d × r between x ∈ R d and y ∈ R r isdenoted as C xy = E (cid:2) ( x − E [ x ])( y − E [ y ]) T (cid:3) .For a d × d positive definite matrix C (cid:31) , its Choleskyfactorization results in a full rank matrix Q ∈ R d × d suchthat C = Q T Q (1)
2. Prior Work
Adversarial Representation Learning:
In the context ofimage classification, adversarial learning has been utilizedto obtain representations that are invariant across domains[8, 9, 28]. Such representations allow classifiers that aretrained on a source domain to generalize to a different target2omain. In the context of learning fair and unbiased repre-sentations, a number of approaches [7, 31, 3, 29, 23, 26, 1]have used and argued [21] for explicit adversarial net-works , to extract sensitive attributes from the encoded data.With the exception of [26] all the other methods are set upas a minimax game between the encoder, a target task andthe adversary. The encoder is setup to achieve fairness bymaximizing the loss of the adversary i.e. minimizing nega-tive log-likelihood of sensitive variables as measured by theadversary. Roy et al . [26] identify and address the insta-bility in the optimization in the zero-sum minimax formu-lation of ARL and propose an alternate non-zero sum so-lution, demonstrating significantly improved empirical per-formance. All the above approaches use deep neural net-works to represent the ARL entities, optimize their param-eters through simultaneous stochastic gradient descent, andrely on empirical validation. However, none of them seek tostudy the nature of the ARL formulation itself i.e., in termsof decoupling the role of the expressiveness of the mod-els and convergence/stability properties of the optimizationtools for learning the parameters of said models. Therefore,we seek to bridge this gap by studying simpler forms ofARL from a global optimization perspective. Privacy, Fairness and Invariance:
Concurrent work onlearning fair or invariant representations of data included anencoder and a target predictor but did not involve an explicitadversary. The role of the adversary is played by an explicithand designed objective that, typically, competes with thatof the target task. The concept of learning fair representa-tions was first introduced by Zemel et al . [30]. The goalwas to learn a representation of data by “fair clustering”while maintaining the discriminative features of the predic-tion task. Building upon this work, many techniques havebeen proposed to learn an unbiased representation of datawhile retaining its effectiveness for a prediction task. Theseinclude the Variational Fair Autoencoder [20] and the morerecent information bottleneck based objective by Moyer etal . [24]. As with the ARL methods above, these approachesrely on empirical validation. Neither of them study theirrespective non-convex objectives from an optimization per-spective, nor do they provide any provable guarantees onachievable trade-off between fairness and utility. The com-peting nature of the objectives considered in this body ofwork shares resemblance to the non-convex objectives thatwe study in this paper. Though it is not our focus, the ap-proach presented here could potentially be extended to ana-lyze the aforementioned methods.
Optimization Theory for Adversarial Learning:
Thenon-convex nature of the ARL formulation poses uniquechallenges from an optimization perspective. Practically,the parameters of the models in ARL are optimized through Proxies at training to mimic unknown adversaries during inference. stochastic gradient descent, either jointly [7, 22] or alterna-tively [8], with the former being a generalization of gradientdescent. While the convergence properties of gradient de-scent and its variants are well understood, there is relativelylittle work on the convergence and stability of simultane-ous gradient descent in adversarial minimax problems. Re-cently, Mescheder et al . [22] and Nagarajan et al . [25] bothleveraged tools from non-linear systems theory [13] to an-alyze the convergence properties of simultaneous gradientdescent, in the context of GANs, around a given equilib-rium. They show that without the introduction of additionalregularization terms to the objective of the zero-sum game,simultaneous gradient descent does not converge. However,their analysis is restricted to the two player GAN setting andis not concerned with its global optima.In the context of fair representation learning, Komiyama et al . [16] consider the problem of enforcing fairness con-straints in linear regression and provide a solution to ob-tain the global optima of the resulting non-convex prob-lem. While we derive inspiration from this work, our prob-lem setting and technical solution are both notably different.Specifically, their approach does not involve, (1) an explicitadversary as a measure of sensitive information in the repre-sentation, and (2) an encoder tasked with disentangling anddiscarding the sensitive information in the data.
3. Adversarial Representation Learning
Let the data matrix X = [ x , . . . , x n ] ∈ R d × n be n re-alizations of d -dimensional data x ∈ R d . Assume that x is associated with a sensitive attribute s ∈ R q and a targetattribute y ∈ R p . We denote n realizations of sensitive andtarget attributes as S = [ s , · · · , s n ] and Y = [ y , · · · , y n ] ,respectively. Treating the attributes as vectors enables us toconsider both multi-class classification and regression un-der the same setup. The adversarial representation learning problem is for-mulated with the goal of learning parameters of an embed-ding function E ( · ; Θ E ) : x (cid:55)→ z with two objectives: (i)aiding a target predictor T ( · ; Θ y ) to accurately infer thetarget attribute y from z , and (ii) preventing an adversary A ( · ; Θ s ) from inferring the sensitive attribute s from z . TheARL problem can be formulated as, min Θ E min Θ y L y ( T ( E ( x ; Θ E ); Θ y ) , y )s . t . min Θ s L s ( A ( E ( x ; Θ E ); Θ s ) , s ) ≥ α (2)where L y and L s are the loss functions (averaged over train-ing dataset) for the target predictor and the adversary, re-spectively, α ∈ [0 , ∞ ) is a user defined value that deter-mines the minimum tolerable loss for the adversary on the3ensitive attribute, and the minimization in the constraintis equivalent to the encoder operating against an optimaladversary. Existing instances of this problem adopt deepneural networks to represent E , T and A and learn their re-spective parameters { Θ E , Θ y , Θ s } through simultaneousSGD. We first consider the simplest form of the ARL problemand analyze it from an optimization perspective. We modelboth the adversary and the target predictors as linear regres-sors, ˆy = Θ y z + b y , ˆs = Θ s z + b s (3)where z is an encoded version of x , and ˆy and ˆs are the pre-dictions corresponding to the target and sensitive attributes.We also model the encoder through a linear mapping, Θ E ∈ R r × d : x (cid:55)→ z = Θ E x (4)where r < d is the dimensionality of the projected space.While existing DNN based solutions select r on an ad-hoc basis, our approach for this problem determines r aspart of our solution to the ARL problem. For both adver-sary and target predictors, we adopt the mean squared error(MSE) to assess the quality of their respective predictionsi.e., L y ( y , ˆy ) = E [ (cid:107) y − ˆy (cid:107) ] and L s ( s , ˆs ) = E [ (cid:107) s − ˆs (cid:107) ] . For any given encoder Θ E the following Lemma gives theminimum MSE for a linear regressor in terms of covariancematrices and Θ E . The following Lemma assumes that x iszero-mean and the covariance matrix C x is positive definite.These assumptions are not restrictive since we can alwaysremove the mean and dependent features from x . Lemma 1.
Let x and t be two random vectors with E [ x ] =0 , E [ t ] = b , and C x (cid:31) . Consider a linear regressor, ˆ t = Wz + b , where W ∈ R m × r is the parameter matrix,and z ∈ R r is an encoded version of x for a given Θ E : x (cid:55)→ z = Θ E x , Θ E ∈ R r × d . The minimum MSE thatcan be achieved by designing W is, min W E [ (cid:107) t − ˆ t (cid:107) ] = Tr (cid:2) C t (cid:3) − (cid:13)(cid:13) P M Q − Tx C xt (cid:13)(cid:13) F where M = Q x Θ TE ∈ R d × r , and Q x ∈ R d × d is aCholesky factor of C x as shown in (1). Applying this result to the target and adversary regres-sors, we obtain their minimum MSEs, J y ( Θ E ) = min Θ y L y ( T ( E ( x ; Θ E ); Θ y ) , y )= Tr (cid:2) C y (cid:3) − (cid:13)(cid:13) P M Q − Tx C xy (cid:13)(cid:13) F (5) When r is equal to d , the encoder will be unable to guard against theadversary who can simply learn to invert Θ E . We defer the proofs of all lemmas and theorems to the appendix. J s ( Θ E ) = min Θ s L s ( A ( E ( x ; Θ E ); Θ s ) , s )= Tr (cid:2) C s (cid:3) − (cid:13)(cid:13) P M Q − Tx C xs (cid:13)(cid:13) F (6)Given the encoder, J y ( Θ E ) is related to the performanceof the target predictor; whereas J s ( Θ E ) corresponds to theamount of sensitive information that an adversary is able toleak. Note that the linear model for T and A enables usto obtain their respective optimal solutions for a given en-coder Θ E . On the other hand, when T and A are modeledas DNNs, doing the same is analytically infeasible and po-tentially impractical.The orthogonal projector P M in Lemma 1 is a func-tion of two factors, a data dependent term Q x and the en-coder parameters Θ E . While the former is fixed for a givendataset, the latter is our object of interest. Pursuantly, wedecompose P M in order to separably characterize the effectof these two factors. Let the columns of L x ∈ R d × d be anorthonormal basis for the column space of Q x . Due to thebijection G E = L − x Q x Θ TE ⇔ Θ E = G TE L Tx Q − Tx from L x G E = Q x Θ TE , determining the encoder parameters Θ E is equivalent to determining G E . The projector P M cannow be expressed in terms of P G , which is only dependenton the free parameter G E . P M = M (cid:0) M T M (cid:1) † M T = L x P G L Tx (7)where we used the equality M = Q x Θ TE and the fact that L Tx L x = I .Now, we turn back to the ARL setup and see how theabove decomposition can be leveraged. The optimizationproblem in (2) reduces to, min G E J y ( G E )s . t . J s ( G E ) ≥ α (8)where the minimum MSE measures of (5) and (6) are nowexpressed in terms of G E instead of Θ E .Before solving this optimization problem, we will firstinterpret it geometrically. Consider a simple example where x is a white random vector i.e., C x = I . Under this setting, Q x = L x = I and G E = Θ TE . As a result, the optimizationproblem in (8) can alternatively be solved in terms of G E = Θ TE as J y ( G E ) = Tr (cid:2) C y (cid:3) − (cid:13)(cid:13) P G C xy (cid:13)(cid:13) F and J s ( G E ) =Tr (cid:2) C s (cid:3) − (cid:13)(cid:13) P G C xs (cid:13)(cid:13) F .The constraint J s ( G E ) ≥ α implies (cid:13)(cid:13) P G C xs (cid:13)(cid:13) F ≤ (cid:0) Tr (cid:2) C s (cid:3) − α (cid:1) which is geometrically equivalent to thesubspace G being outside (or tangent to) the cone around C xs . Similarly, minimizing J y ( G E ) implies maximizing (cid:13)(cid:13) P G C xy (cid:13)(cid:13) F , which in turn is equivalent to minimizing theangle between the subspace G and the vector C xy . There-fore, the global optima of (8) is any hyper plane G which isoutside the cone around C xs while subtending the smallest4 z C xs C xy o G Figure 3:
Geometric Interpretation:
An illustration of athree-dimensional input space x and one-dimensional targetand adversary regressors. Therefore, both C xs and C xy areone-dimensional. We locate the y -axis in the same directionas C xs . The feasible space for the solution G E = Θ TE im-posed by the constraint J s ( Θ E ) ≥ α corresponds to theregion outside the cone (specified by C s and α ) around C xs . The non-convexity of the problem stems from the non-convexity of this feasible set. The objective min J y ( Θ E ) corresponds to minimizing the angle between the line C xy and the plane G . When C xy is outside the cone, the line C xy itself or any plane that contains the line C xy and doesnot intersect with the cone, is a valid solution. When C xy is inside the cone, the solution is either a line or, as we il-lustrate, a tangent hyperplane to the cone that is closest to C xy . The non-differentiability stems from the fact that thesolution can either be a plane or a line.angle to C xy . An illustration of this setting and its solutionis shown in Figure 3 for d = 3 , r = 2 and p = q = 1 .Constrained optimization problems such as (8) are com-monly solved through their respective unconstrained La-grangian [2] formulations as shown below min G E ∈ R d × r (cid:110) (1 − λ ) J y ( G E ) − ( λ ) J s ( G E ) (cid:111) (9)for some parameter ≤ λ ≤ . Such an approach af-fords two main advantages and one disadvantage; (a) A di-rect and closed-form solution can be obtained. (b) Framing(9) in terms of λ and (1 − λ ) allows explicit control be-tween the two extremes of no privacy ( λ = 0 ) and no tar-get ( λ = 1 ). As a consequence, it can be shown that forevery λ ∈ [0 , , ∃ α ∈ [ α min , α max ] (see Section A.2 ofappendix for a proof). In practice, given a user specifiedvalue of α min ≤ α tol ≤ α max , we can solve (8) by iterat-ing over λ ∈ [0 , until the solution of (9) yields the samespecified α tol . (c) The vice-versa on the other hand doesnot necessarily hold i.e., for a given tolerable loss α theremay not be a corresponding λ ∈ [0 , . This is the theoreti- cal limitation of solving Lagrangian problem instead of theconstrained problem.Before we obtain the solution to the Lagrangian formu-lation (9), we characterize the nature of the optimizationproblem in the following theorem. Theorem 2.
As a function of G E ∈ R d × r , the objectivefunction in (9) is neither convex nor differentiable. Despite the difficulty associated with the objective in (9),we derive a closed-form solution for its global optima. Ourkey insight lies in partitioning the search space R d × r basedon the rank of the matrix G E . For a given rank i , let S i bethe set containing all matrices G E of rank i , S i = (cid:8) G E ∈ R d × r (cid:12)(cid:12) rank( G E ) = i (cid:9) , i = 0 , , · · · , r Obviously, (cid:83) ri =0 S i = R d × r . As a result, the optimizationproblem in (9) can be solved by considering r minimizationproblems, one for each possible rank of G E : min i ∈{ ,...,r } (cid:110) min G E ∈S i (1 − λ ) J y ( G E ) − ( λ ) J s ( G E ) (cid:111) (10)We observe from (5), (6) and (7), that the optimizationproblem in (9) is dependent only on a subspace G . As such,the solution G E is not unique since many different matri-ces can span the same subspace. Hence, it is sufficient tosolve for any particular G E that spans the optimal subspace G . Without loss of generality we seek an orthonormal ba-sis spanning the optimal subspace G as our desired solution.We constrain G E ∈ R d × i to be an orthonormal matrix i.e., G TE G E = I i where i is the dimensionality of G . Ignoringthe constant terms in J y and J s , for each i = 1 , . . . , r , theminimization problem over S i in (10) reduces to, min G TE G E = I i J λ ( G E ) (11)where J λ ( G E ) = λ (cid:107) L x G E G TE L Tx Q − Tx C xs (cid:107) F − (1 − λ ) (cid:107) L x G E G TE L Tx Q − Tx C xy (cid:107) F From basic properties of trace, we have, J λ ( G E ) =Tr (cid:2) G TE BG E (cid:3) where B ∈ R d × d is a symmetric matrix: B = L Tx Q − Tx (cid:0) λ C Tsx C sx − (1 − λ ) C Tyx C yx (cid:1) Q − x L x (12)The optimization problem in (11) is equivalent to trace min-imization on a Stiefel manifold which has closed-form so-lution(s) (see [15] and [6]).In view of the above discussion the solution to the opti-mization problem in (9) or equivalently (10) can be statedin the next theorem. Practically, as we show in Figures 6 and 10, all values of α ∈ [ α min , α max ] appear to be reachable as we sweep through λ ∈ [0 , . heorem 3. Assume that the number of negative eigenval-ues ( β ) of B in (12) is j . Denote γ = min { r, j } . Then, theminimum value in (10) is given as, β + β + · · · + β γ (13) where β ≤ β ≤ . . . ≤ β γ < are the γ smallest eigen-values of B . And the minimum can be attained by G E = V ,where the columns of V are eigenvectors corresponding toall the γ negative eigenvalues of B . Note that, including the eigenvectors corresponding tozero eigenvalues of B into our solution G E in Theorem 3does not change the minimum value in (13). But, con-sidering only negative eigenvectors results in G E with theleast rank and thereby an encoder that is less likely to con-tain sensitive information for an adversary to exploit. Once G E is constructed, we can obtain our desired encoder as, Θ E = G TE L Tx Q − Tx . Recall that the solution in Theorem 3is under the assumption that the covariance C x is a full-rankmatrix. In Section B of the appendix, we develop a solutionfor the more practical and general case where empirical mo-ments are used instead. We extend the “linear” version of the ARL problem stud-ied thus far to a “non-linear” version through kernelization.We model the encoder in the ARL problem as a linear func-tion over non-linear mapping of inputs as illustrated in Fig-ure 2. Let the data matrix X be mapped non-linearly by apossibly unknown and infinite dimensional function φ x ( · ) to Φ x . Let the corresponding reproducing kernel functionbe k x ( · , · ) . The centered kernel matrix can be obtained as, ˜ K x = ˜ Φ Tx ˜ Φ x = D T Φ Tx Φ x D = D T K x D (14)where K x is the kernel matrix on the original data X .If the co-domain of φ x ( · ) is infinite dimensional (e.g.,RBF kernel), then the encoder in (4) would be also be in-finite dimensional i.e., Θ E ∈ R r ×∞ , which is infeasibleto learn directly. However, the representer theorem [27] al-lows us to construct the encoder as a linear function of ˜ Φ T ,i.e, Θ E = Λ ˜ Φ Tx = ΛD T Φ Tx . Hence, a data sample x canbe mapped through the “kernel trick” as, Θ E φ x ( x ) = ΛD T Φ Tx φ x ( x )= ΛD T [ k x ( x , x ) , · · · , k x ( x n , x )] T (15)Hence, designing Θ E is equivalent to designing Λ ∈ R r × n . The Lagrangian formulation of this Kernel-ARLsetup and its solution shares the same form as that of thelinear case (9). The objective function remains non-convexand non-differentiable, while the matrix B is now depen-dent on the kernel matrix K x as opposed to the covariancematrix C x (see Section C of the appendix for details). B = L Tx (cid:0) λ ˜ S T ˜ S − (1 − λ ) ˜ Y T ˜ Y (cid:1) L x (16) where the columns of L x are the orthonormal basis for ˜ K x .Once G E is obtained through the eigendecomposition of B ,we can find Λ as Λ = G TE L Tx ˜ K † x . This non-linear exten-sion in the form of kernelization serves to study the ARLproblem under a setting where the encoder possess greaterrepresentational capacity while still being able to obtain theglobal optima and bounds on objectives of the target pre-dictor and the adversary as we show next. Algorithm 1 pro-vides a detailed procedure for solving both the Linear-ARLand Kernel-ARL formulations.
4. Analytical Bounds
In this section we introduce bounds on the utility and in-variance of the representation learned by SARL. We definefour bounds α min , α max , γ min and γ max . γ min : A lower bound on the minimum achievable targetloss, or equivalently an upper bound on the best achievabletarget performance. This bound can be expressed as theminimum target MSE across all possible encoders Θ E andis attained at λ = 0 . γ min = min Θ E J y ( Θ E ) α max : A upper bound on the maximum achievable ad-versary loss, or equivalently a lower bound on the mini-mum leakage of sensitive attribute. This bound can be ex-pressed as the maximum adversary MSE across all possibleencoders Θ E and is attained at λ = 1 . α max = max Θ E J s ( Θ E ) γ max : An upper bound on the maximum achievable targetloss, or equivalently a lower bound on the minimum achiev-able target performance. This bound corresponds to the sce-nario where the encoder is constrained to maximally hinderthe adversary. In all other cases one can obtain higher targetperformance by choosing a better encoder. This bound isattained in the limit λ → and can be expressed as, γ max = min arg max J s ( Θ E ) J y ( Θ E ) α min : A lower bound on the minimum achievable adver-sary loss, or equivalently an upper bound on the maximumleakage of sensitive attribute. The absolute lower bound isobtained in the scenario where the encoder is neither con-strained to aid the target nor hinder the adversary i.e., α ∗ min = min Θ E J s ( Θ E ) However, this is an unrealistic scenario since in the ARLproblem, by definition, the encoder is explicitly designed toaid the target. Therefore, a more realistic lower bound can6e defined under the constraint that the encoder maximallyaids the target i.e., ¯ α min = min arg min J y ( Θ E ) J s ( Θ E ) However, even this bound is not realistic, since, among allthe encoders that aid the target one can always choose theencoder that minimizes the leakage of the sensitive attribute.The bound corresponding to such an encoder can be ex-pressed as, α min = max arg min J y ( Θ E ) J s ( Θ E ) This bound is attained in the limit λ → . It is easy to seethat these bounds are ordinally related as, α ∗ min ≤ ¯ α min ≤ α min To summarize, in each of these cases, there exists an en-coder that achieves the respective bound. Therefore, givena choice, the encoder that corresponds to α min is the mostdesirable.The following Lemma defines these bounds and their re-spective closed form expressions as a function of data. Lemma 4.
Let the columns of L x be the orthonormal basisfor ˜ K x (in linear case ˜ K x = ˜ X T ˜ X ). Further, assume thatthe columns of V s are the singular vectors corresponding tozero singular values of ˜ SL x and the columns of V y are thesingular vectors corresponding to non-zero singular valuesof ˜ YL x . Then, we have γ min = min Θ E J y ( Θ E )= 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:107) ˜ YL x (cid:107) F γ max = min arg max J s ( Θ E ) J y ( Θ E )= 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ YL x V s (cid:13)(cid:13) F α min = max arg min J y ( Θ E ) J s ( Θ E )= 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ SL x V y (cid:13)(cid:13) F α max = max Θ E J s ( Θ E )= 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F Under the special case of one dimensional data, i.e., x , y and s are scalars, the above bounds can be related to the nor-malized correlation of the variables involved. Specifically,the normalized bounds γ min and α min can be expressed as, γ min σ y = 1 − ρ ( x , y ) α min σ s = 1 − ρ ( x , s ) where ρ ( · , · ) denotes the correlation coefficient (i.e., nor-malized correlation) between two random variables and σ y = E [˜ y ] ( σ s is similarly defined). Similarly, the up-per bounds γ max and α max can be expressed in terms of thevariance of the label space as, γ max σ y = α max σ s = 1 Therefore, in the one-dimensional setting, the achievablebounds are related to the underlying alignment between thesubspace spanned by the data X and the respective sub-spaces spanned by the labels S and Y .
5. Computational Complexity
In the case of Linear-SARL, calculating the covariancematrices C x , C yx and C sx requires O ( d n ) , O ( p n ) and O ( q d ) multiplications, respectively. Next, the complexityof Cholesky factorization C x = Q Tx Q x and calculating itsinverse Q − x is O ( d ) each. Finally, solving the optimiza-tion problem has a complexity of O ( d ) to eigendecomposethe d × d matrix B . In the case of Kernel-SARL, eigen-decomposition of B requires O ( n ) operations. However,for scalability i.e., large n (e.g., CIFAR-100), the Nystr¨ommethod with data sampling [18] can be adopted. To summa-rize, the complexity of the linear and kernel formulations is O ( d ) and O ( n ) , respectively.
6. Numerical Experiments
We evaluate the efficacy of the proposed Spectral-ARL(SARL) algorithm in finding the global optima, and com-pare it with other ARL baselines that are based on the stan-dard simultaneous SGD optimization (henceforth referredto as SSGD). In all experiments we refer to our solution for“linear” ARL as Linear-SARL and the solution to the “ker-nel” version of the encoder with linear classifiers for thepredictor and adversary as Kernel-SARL.
We first consider a simple example in order to visu-alize and compare the learned embeddings from differentARL solutions. We consider a three-dimensional problemwhere each data sample consists of two attributes, color andshape. Specifically, the input data X is generated from amixture of four different Gaussian distributions correspond-ing to different possible combinations of the attributes i.e., {(cid:13) , (cid:13) , × , ×} with means at µ = (1 , , , µ = (2 , , , µ = (2 , . , , µ = (2 . , , and identical covariancematrices Σ = diag (cid:0) . , . , . (cid:1) . The shape attribute isour target while color is the sensitive attribute as illustratedin Figure 4. The goal of the ARL problem is to learn anencoder that projects the data such that it remains separable7igure 4: Samples from a mixture of four Gaussians. Eachsample has two attributes, shape and color.
50 55 60 63.2
Adversary Accuracy [%] T a r ge t A cc u r a cy [ % ] No PrivacySSGDLinear-SARLKernel-SARL
Figure 5:
Gaussian Mixture:
Trade-off between target per-formance and leakage of sensitive attribute by adversary. minmaxtraintest (a) Linear-SARL minmaxtraintest (b) Kernel-SARL
Figure 6:
Gaussian Mixture:
Lower and upper bounds onadversary loss, α min and α max , computed on training set.The loss achieved by our solution as we vary λ is shown onthe training and testing sets, α train and α test , respectively.with respect to the shape and non-separable with respect tothe color attribute. We sample 4000 points to learn linear and non-linear(Gaussian kernel) encoders across λ ∈ [0 , . To train theencoder, the one-hot encoding of target and sensitive labelsare treated as the regression targets. Then, we freeze theencoder and train logistic regressors for the adversary andtarget task for each λ . We evaluate their classification per-formance on a separate set of 1000 samples. The resultingtrade-off front between target and adversary performance isshown in Figure 5. We make the following observations,(1) For λ = 1 , all methods achieve an accuracy of for the adversary which indicates complete removal of fea-tures corresponding to the sensitive attribute via our encod-ing, (2) At small values of λ the objective of Linear-ARLis close to being convex, hence the similarity in the trade-off fronts of Linear-SARL and SSGD in that region. How-ever, everywhere else due to the iterative nature of SSGD, itis unable to find the global solution and achieve the sametrade-off as Linear-SARL. (3) The non-linear encoder inthe Kernel-SARL solution significantly outperforms bothLinear-SARL and SSGD. The non-linear nature of the en-coder enables it to strongly entangle the color attribute (50%accuracy) while simultaneously achieving a higher targetaccuracy than the linear encoder. Figure 7 visualizes thelearned embedding space z for different trade-offs betweenthe target and adversary objectives.Figure 6 shows the mean squared error (MSE) of the ad-versary as we vary the relative trade-off λ between the targetand adversary objectives. The plot illustrates, (1) the lowerand upper bounds α min and α max respectively calculatedon the training dataset, (2) achievable adversary MSE com-puted on the training set α train , and finally (3) achievableadversary MSE computed on the test set α test . Observe thaton the training dataset, all values of α ∈ [ α min , α max ] arereachable as we sweep through λ ∈ [0 , . This is how-ever not the case on the test set as the bounds are computedthrough empirical moments as opposed to the true covari-ance matrices. We consider the task of learning representations that areinvariant to a sensitive attribute on two datasets, Adult andGerman, from the UCI ML-repository [5]. For compari-son, apart from the raw features X , we consider severalbaselines that use DNNs and trained through simultaneousSGD; LFR [30], VAE [14], VFAE [20], ML-ARL [29] andMaxEnt-ARL [26].The Adult dataset contains attributes. There are , and , instances in the training and test sets,respectively. The target task is binary classification of an-nual income i.e., more or less than K and the sensitiveattribute is gender. Similarly, the German dataset contains instances of individuals with different attributes.The target is to classify the credit of individuals as good or8 a) Color ( λ = 0 ) (b) Color ( λ = 0 . ) (c) Color ( λ = 1 ) (d) Color ( λ = 0 . )(e) Shape ( λ = 0 ) (f) Shape ( λ = 0 . ) (g) Shape ( λ = 1 ) (h) Shape ( λ = 0 . ) Figure 7:
Gaussian Mixture:
The optimal dimensionality of embedding z is 1. Visualization of the embedding histogramw.r.t each attribute for different relative emphasis, λ , on the target (shape) and sensitive attributes (color). Top row is colorand bottom row is shape. First three columns show results for a linear-encoder. At λ = 0 the weight on the adversary is 0, socolor is still separable. As the value of λ increases, we observe that the colors are less and less separable. Last column showsresults for a kernel-encoder. Visualization of the embedding histogram for λ = 0 . . We observe that the target attribute isquite separable while the sensitive attribute is entangled.Table 1: Fair Classification Performance (in %) Adult Dataset German DatasetMethod Target Sensitive ∆ ∗ Target Sensitive ∆ ∗ (income) (gender) (credit) (age)Raw Data 85.0 85.0 17.6 80.0 87.0 6.0LFR [30] 82.3 67.0 0.4 72.3 80.5 0.5VAE [14] 81.9 66.0 1.4 72.5 79.5 1.5VFAE [20] 81.3 67.0 0.4 72.7 79.7 1.3ML-ARL [29] 84.4 67.7 0.3 74.4 80.2 0.8MaxEnt-ARL [26] 84.6 65.5 1.9 72.5 80.0 1.0Linear-SARL 84.1 67.4 0.0 76.3 80.9 0.1Kernel-SARL 84.1 67.4 0.0 76.3 80.9 0.1 ∗ Absolute difference between adversary accuracy and random guess bad with the sensitive attribute being age.We learn encoders on the training set, after which, fol-lowing the baselines, we freeze the encoder and train thetarget (logistic regression) and adversary (2 layer networkwith 64 units) classifiers on the training set. Table 1 showsthe performance of target and adversary on both datasets.Both Linear-SARL and Kernel-SARL outperform all DNNbased baselines. For either of these tasks, the Kernel-SARLdoes not afford any additional benefit over Linear-SARL.For the adult dataset, the linear encoder maps the 14 in-put features to just one dimension. The weights assigned toeach feature is shown in Figure 8. Notice that the encoderassigns almost zero weight to the gender feature in order tobe fair with respect to the gender attribute. . . . . . . . . . a g e w o r k c l a ss fi n a l w e i gh t e du ca ti on e du ca ti onnu m m a r it a l s t a t u s o cc up a ti on r e l a ti on s h i p r ace g e nd e r ca p it a l g a i n ca p it a ll o ss hou r s / w ee k c oun t r y Figure 8:
Adult Dataset:
Magnitude of learned encoderweights Θ E for each semantic input feature. This task pertains to face classification under different il-lumination conditions on the Extended Yale B dataset [10].It comprises of face images of 38 people under five differ-ent light source directions, namely, upper right, lower right,lower left, upper left, and front. The target task is to estab-9able 2: Extended Yale B Performance (in %)
Method Adversary Target Adversary Target(illumination) (identity) (identity) (illumination)Raw Data 96 78 - -VFAE [20] 57 85 - -ML-ARL [29] 57 89 - -MaxEnt-ARL [26] 40 89 - -Linear-SARL 21 81 3 94Linear-SARL [EX] 20 86 3 97Kernel-SARL 20 86 3 96Kernel-SARL [EX] 20 88 3 96 lish the identity of the person in the image with the directionof the light being the sensitive attribute. Since the direc-tion of lighting is independent of identity, the ideal ARLsolution should obtain a representation z that is devoid ofany sensitive information. We first followed the experimen-tal setup of Xie et al . [29] in terms of the train/test splitstrategy i.e., 190 samples (5 from each class) for trainingand 1096 images for testing. Our global solution was ableto completely remove illumination from the embedding re-sulting in the adversary accuracy being 20% i.e., randomchance. To investigate further, we consider different varia-tions of this problem, flipping target and sensitive attributesand exchanging training and test sets. The complete set ofresults, including DNN based baselines are reported in Ta-ble 2 ([EX] corresponds to exchanging training and testingsets). In all these cases, our solution was able to completelyremove the sensitive features resulting in adversary perfor-mance that is no better than random chance. Simultane-ously, the embedding is also competitive with the baselineson the target task. The CIFAR-100 dataset [17] consists of 50,000 imagesfrom 100 classes that are further grouped into 20 super-classes. Each image is therefore associated with two at-tributes, a “fine” class label and a “coarse” superclass la-bel. We consider a setup where the “coarse” and “fine”labels are the target and sensitive attributes, respectively.For Linear-SARL and Kernel-SARL (degree five polyno-mial kernel) and SSGD we use features (64-dimensional)extracted from a pre-trained ResNet-110 model as an inputto the encoder, instead of raw images. From these features,the encoder is tasked with aiding the target predictor andhindering the adversary. This setup serves as an exampleto illustrate how invariance can be “imparted” to an exist-ing biased pre-trained representation. We also consider twoDNN baselines, ML-ARL [29] and MaxEnt-ARL [26]. Un-like our scenario, where the pre-trained layers of ResNet-18 are not adapted, the baselines optimize the entire en-coder for the ARL task. For evaluation, once the encoderis learned and frozen, we train a discriminator and adver-sary as 2-layer networks with 64 neurons each. Therefore,
Adversary Accuracy [%] T a r ge t A cc u r a cy [ % ] No PrivacySSGDML-ARL[24]MaxEnt-ARL[21]Linear-SARLKernel-SARL
Figure 9:
CIFAR-100:
Trade-off between target perfor-mance and leakage of sensitive attribute by adversary.although our approach uses linear regressor as adversaryat training, we evaluate against stronger adversaries at testtime. In contrast, the baselines train and evaluate againstadversaries with equal capacity.Figure 9 shows the trade-off in accuracy between the tar-get predictor and adversary. We observe that, (1) Kernel-ARL significantly outperforms Linear-SARL. Since the for-mer implicitly maps the data into an higher dimensionalspace, the sensitive features are potentially disentangledsufficiently for the linear encoder in that space to discardsuch information. Therefore, even for large values of λ ,Kernel-SARL is able to simultaneously achieve high tar-get accuracy while keeping the adversary performance low.(2) Despite being handicapped by the fact that Kernel-SARL is evaluated against stronger adversaries than it istrained against, its performance is comparable to that ofthe DNN baselines. In fact, it outperforms both ML-ARLand MaxEnt-ARL with respect to the target task. (3) De-spite repeated attempts with different hyper-parameters andchoice of optimizers, SSGD was highly unstable acrossmost datasets and often got stuck in a local optima andfailed to find good solutions.Figure 10 shows the mean squared error (MSE) of theadversary as we vary the relative trade-off λ between thetarget and adversary objectives. The plot illustrates, (1) thelower and upper bounds α min and α max respectively calcu-lated on the training dataset, (2) achievable adversary MSEcomputed on the training set α train , and finally (3) achiev-able adversary MSE computed on the test set α test . Observethat on the training dataset, all values of α ∈ [ α min , α max ] are reachable as we sweep through λ ∈ [0 , . This is how-ever not the case on the test set as the bounds are computedthrough empirical moments as opposed to the true covari-ance matrices.Figure 11 plots the optimal embedding dimensionality10 minmaxtraintest (a) Linear-SARL minmaxtraintest (b) Kernel-SARL Figure 10:
CIFAR-100:
Lower and upper bounds on adver-sary loss, α min and α max , computed on training set. Theloss achieved by our solution as we vary λ is shown on thetraining and testing sets, α train and α test , respectively. E m bedd i ng D i m en s i ona li t y Linear-SARLKernel-SARL
Figure 11:
CIFAR-100:
Optimal embedding dimensional-ity learned by SARL. At small values of λ , the objectivefavors the target task which predicts 20 classes. Thus, em-bedding dimensionality of 19 is optimal for a linear targetregressor. At large values of λ , the objective only seeks tohinder the adversary. Thus, SARL determines the optimaldimensionality of the embedding as one.provided by SARL as a function of the trade-off parameter λ . At small values of λ , the objective favors the target taski.e., 20 class prediction. Thus, SARL does indeed deter-mine the optimal dimensionality of 19 for a 20 class lineartarget regressor. However, at large values of λ , the objectiveonly seeks to hinder the sensitive task i.e., 100 class predic-tion. In this case, the ideal embedding dimensionality fromthe perspective of the linear adversary regressor is at least99. The SARL ascertained dimensionality of one is, thus,optimal for maximally mitigating the leakage of sensitiveattribute from the embedding. However, unsurprisingly, thetarget task also suffers significantly.
7. Concluding Remarks
We studied the “linear” form of adversarial representa-tion learning (ARL), where all the entities are linear func-tions. We showed that the optimization problem evenfor this simplified version is both non-convex and non-differentiable. Using tools from spectral learning we ob-tained a closed form expression for the global optima andderived analytical bounds on the achievable utility and in-variance. We also extended these results to non-linear pa-rameterizations through kernelization. Numerical experi-ments on multiple datasets indicated that the global optimasolution of the “kernel” form of ARL is able to obtain atrade-off between utility and invariance that is comparableto that of local optima solutions of deep neural networkbased ARL. At the same time, unlike DNN based solu-tions, the proposed method can, (1) analytically determinethe achievable utility and invariance bounds, and (2) pro-vide explicit control over the trade-off between utility andinvariance.Admittedly, the results presented in this paper do not ex-tend directly to deep neural network based formulations ofARL. However, we believe it sheds light on nature of theARL optimization problem and aids our understanding ofthe ARL problem. It helps delineate the role of the opti-mization algorithm and the choice of embedding function,highlighting the trade-off between the expressivity of thefunctions and our ability to obtain the global optima of theadversarial game. We consider our contribution as the firststep towards controlling the non-convexity that naturallyappears in game-theoretic representation learning.
Acknowledgements:
This work was performed under thefollowing financial assistance award 60NANB18D210 fromU.S. Department of Commerce, National Institute of Stan-dards and Technology.
References [1] Martin Bertran, Natalia Martinez, Afroditi Papadaki, QiangQiu, Miguel Rodrigues, Galen Reeves, and GuillermoSapiro. Adversarially learned representations for informa-tion obfuscation and inference. In
International Conferenceon Machine Learning , 2019. 3[2] DP Bertsekas. Nonlinear programming 2nd edn (belmont,ma: Athena scientific). 1999. 5[3] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data deci-sions and theoretical implications when adversarially learn-ing fair representations. arXiv preprint arXiv:1707.00075 ,2017. 3[4] Stephen Boyd and Lieven Vandenberghe.
Convex Optimiza-tion . Cambridge University Press, 2004. 14[5] Dheeru Dua and Casey Graff. UCI machine learning reposi-tory, 2017. 8[6] Alan Edelman, Tom´as A Arias, and Steven T Smith. The ge-ometry of algorithms with orthogonality constraints.
SIAM ournal on Matrix Analysis and Applications , 20(2):303–353, 1998. 5[7] Harrison Edwards and Amos Storkey. Censoring represen-tations with an adversary. arXiv preprint arXiv:1511.05897 ,2015. 1, 3[8] Yaroslav Ganin and Victor Lempitsky. Unsupervised domainadaptation by backpropagation. In International Conferenceon Machine Learning , 2015. 1, 2, 3[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-cal Germain, Hugo Larochelle, Franc¸ois Laviolette, MarioMarchand, and Victor Lempitsky. Domain-adversarial train-ing of neural networks.
Journal of Machine Learning Re-search , 17(1):2096–2030, 2016. 2[10] Athinodoros S Georghiades, Peter N Belhumeur, and David JKriegman. From few to many: Illumination cone modelsfor face recognition under variable lighting and pose.
IEEETransactions on Pattern Analysis & Machine Intelligence ,(6):643–660, 2001. 9[11] Gene H Golub and Victor Pereyra. The differentiation ofpseudo-inverses and nonlinear least squares problems whosevariables separate.
SIAM Journal on Numerical Analysis ,10(2):413–432, 1973. 14[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In
Advances inNeural Information Processing Systems , 2014. 2[13] Hassan K Khalil. Nonlinear systems.
Prentice Hall , 1996. 3[14] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 , 2013. 8, 9[15] Effrosini Kokiopoulou, Jie Chen, and Yousef Saad. Traceoptimization and eigenproblems in dimension reductionmethods.
Numerical Linear Algebra with Applications ,18(3):565–602, 2011. 5, 15, 20[16] Junpei Komiyama, Akiko Takeda, Junya Honda, and HajimeShimao. Nonconvex optimization for regression with fair-ness constraints. In
International Conference on MachineLearning , 2018. 3[17] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images. Technical report, Cite-seer, 2009. 10[18] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sam-pling methods for the Nystr¨om method.
Journal of MachineLearning Research , 13(Apr):981–1006, 2012. 7[19] Alan J Laub.
Matrix analysis for scientists and engineers ,volume 91. SIAM, 2005. 2[20] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling,and Richard Zemel. The variational fair autoencoder. arXivpreprint arXiv:1511.00830 , 2015. 1, 3, 8, 9, 10[21] David Madras, Elliot Creager, Toniann Pitassi, and RichardZemel. Learning adversarially fair and transferable represen-tations. In
International Conference on Machine Learning ,2018. 1, 2, 3[22] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger.The numerics of gans. In
Advances in Neural InformationProcessing Systems , 2017. 2, 3[23] Vahid Mirjalili, Sebastian Raschka, Anoop Namboodiri, andArun Ross. Semi-adversarial networks: Convolutional au- toencoders for imparting privacy to face images. In
Interna-tional Conference on Biometrics , 2018. 3[24] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Gal-styan, and Greg Ver Steeg. Invariant representations withoutadversarial training. In
Advances in Neural Information Pro-cessing Systems , 2018. 3[25] Vaishnavh Nagarajan and J Zico Kolter. Gradient descentgan optimization is locally stable. In
Advances in NeuralInformation Processing Systems , 2017. 2, 3[26] Proteek Roy and Vishnu Naresh Boddeti. Mitigating infor-mation leakage in image representations: A maximum en-tropy approach. In
IEEE Conference on Computer Visionand Pattern Recognition , 2019. 1, 2, 3, 8, 9, 10[27] John Shawe-Taylor, Nello Cristianini, et al.
Kernel methodsfor pattern analysis . Cambridge University Press, 2004. 6,17[28] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell.Adversarial discriminative domain adaptation. In
IEEE Con-ference on Computer Vision and Pattern Recognition , 2017.2[29] Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Gra-ham Neubig. Controllable invariance through adversarialfeature learning. In
Advances in Neural Information Pro-cessing Systems , 2017. 1, 3, 8, 9, 10[30] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn-thia Dwork. Learning fair representations. In
InternationalConference on Machine Learning , 2013. 3, 8, 9[31] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell.Mitigating unwanted biases with adversarial learning. In
AAAI/ACM Conference on AI, Ethics, and Society , 2018. 3 ppendices Here we include; (a) Section A.1: Proof of Lemma 1, (b) Section A.2: Proof of relation between constrained optimizationproblem in (8) and its Lagrangian formulation in (9), (c) Section A.3: Proof of Theorem 2, (d) Section A.4: Proof of Theorem3, (e) Section B: Empirical moments based solution to linear encoder, (f) Section C: A detailed description of the Kernel-ARLextension, including derivation of its solution, and (g) Section D: Proof of Lemma 4.
A. Proofs
We recall that for any square matrix M , its trace, denoted by Tr[ M ] , is defined as the sum of all its diagonal elements.The Frobenius norm of M can be obtained as (cid:107) M (cid:107) F = Tr( MM T ) . This allows us to express the MSE of a centered randomvector in terms of its covariance matrix: E (cid:110)(cid:13)(cid:13) y − b y (cid:13)(cid:13) (cid:111) = Tr (cid:104) E (cid:8) ( y − b y )( y − b y ) T (cid:9)(cid:105) = Tr[ C y ] . Let A and B be two arbitrary matrices with the same dimension. Further, assume that the subspace R ( A ) is orthogonal to R ( B ) . Then, using orthogonal decomposition (i.e., Pythagoras theorem), we have (cid:13)(cid:13) A + B (cid:13)(cid:13) F = (cid:13)(cid:13) A (cid:13)(cid:13) F + (cid:13)(cid:13) B (cid:13)(cid:13) F . We provide the statements of the lemmas and theorems for sake of convenience, along with their proofs.
A.1. Proof of Lemma 1
Lemma 5.
Let x and t be two random vectors with E [ x ] = 0 , E [ t ] = b , and C x (cid:31) . Consider a linear regressor, ˆ t = Wz + b , where W ∈ R m × r is the parameter matrix, and z ∈ R r is an encoded version of x for a given Θ E : x (cid:55)→ z = Θ E x , Θ E ∈ R r × d . The minimum MSE that can be achieved by designing W is given as min W E [ (cid:107) t − ˆ t (cid:107) ] = Tr (cid:2) C t (cid:3) − (cid:13)(cid:13) P M Q − Tx C xt (cid:13)(cid:13) F where M = Q x Θ TE ∈ R d × r , and Q x ∈ R d × d is a Cholesky factor of C x as shown in (1).Proof. Direct calculation yields: J t = E (cid:110)(cid:13)(cid:13) t − ˆ t (cid:13)(cid:13) (cid:111) = Tr (cid:104) E (cid:110) ( t − b − Wz )( t − b − Wz ) T (cid:111)(cid:105) = Tr (cid:104) E (cid:110) ( t − b )( t − b ) T + ( WΘ E x )( WΘ E x ) T − ( t − b )( WΘ E x ) T − ( WΘ E x )( t − b ) T (cid:111)(cid:105) = Tr (cid:104) C t + ( WΘ E ) C x ( WΘ E ) T − C tx ( WΘ E ) T − ( WΘ E ) C Ttx (cid:105) = Tr (cid:104) C t + ( WΘ E Q Tx )( WΘ E Q Tx ) T − C tx ( WΘ E ) T − ( WΘ E ) C Ttx (cid:105) = Tr (cid:104) ( WΘ E Q Tx − C tx Q − x )( WΘ E Q Tx − C tx Q − x ) T + C t − ( C tx Q − x )( C tx Q − x ) T (cid:105) = (cid:13)(cid:13) Q x Θ TE W T − Q − Tx C xy (cid:13)(cid:13) F − (cid:13)(cid:13) Q − Tx C xt (cid:13)(cid:13) F + Tr[ C t ] Hence, the minimizer of J t is obtained by minimizing the first term in the last equation, which is a standard least square errorproblem. Let M = Q x Θ TE , then the minimizer is given by W T = M † Q − Tx C xt Using the orthogonal decomposition (cid:13)(cid:13) Q − Tx C xt (cid:13)(cid:13) F = (cid:13)(cid:13) P M Q − Tx C xt (cid:13)(cid:13) F + (cid:13)(cid:13) P M ⊥ Q − Tx C xt (cid:13)(cid:13) F (cid:13)(cid:13) Q x Θ TE W T − Q − Tx C xt (cid:13)(cid:13) F = (cid:13)(cid:13) MW T − P M Q − Tx C xt (cid:13)(cid:13) F + (cid:13)(cid:13) P M ⊥ Q − Tx C xt (cid:13)(cid:13) F = (cid:13)(cid:13) MM † (cid:124) (cid:123)(cid:122) (cid:125) P M Q − Tx C xt − P M Q − Tx C xt (cid:13)(cid:13) F + (cid:13)(cid:13) P M ⊥ Q − Tx C xt (cid:13)(cid:13) F = (cid:13)(cid:13) P M ⊥ Q − Tx C xt (cid:13)(cid:13) F Therefore, we obtain the minimum value as, Tr (cid:2) C t (cid:3) − (cid:13)(cid:13) P M Q − Tx C xt (cid:13)(cid:13) F A.2. Relation Between Constrained Optimization Problem in (8) and its Lagrangian Formulation in (9)
Consider the optimization problem in (8) G α = arg min G J y ( G ) , s . t . J s ( G ) ≥ α. (17)and the optimization problem in (9) G λ = arg min G J λ ( G ) (18)where J λ ( G ) = (1 − λ ) J y ( G ) − λJ s ( G ) , λ ∈ [0 , Claim
For each λ ∈ [0 , , solution G λ of (18) is also a solution of (17) with α = J s ( G λ ) . (19) Proof.
Let us consider (17) while assuming that (18) is satisfied. For each λ and corresponding solution G λ , let α be givenas in (19). For an arbitrary G satisfying J s ( G ) ≥ α , we have (1 − λ ) J y ( G λ ) − λα = (1 − λ ) J y ( G λ ) − λJ s ( G λ ) ≤ (1 − λ ) J y ( G ) − λJ s ( G ) , (20)where the second step is from the assumption that (18) is satisfied. Consequently, we have, (1 − λ ) (cid:2) J y ( G ) − J y ( G λ ) (cid:3) ≥ λ (cid:2) J s ( G ) − α (cid:3) ≥ (21)Since J s ( G ) ≥ α , this implies that J y ( G ) ≥ J y ( G λ ) and consequently G λ is a possible minimizer of problem (17). A.3. Proof of Theorem 2
Theorem 6.
As a function of G E ∈ R d × r , the objective function in equation (9) is neither convex nor differentiable.Proof. Recall that P G is equal to G E ( G TE G E ) † G TE . Therefore, due to the involvement of the pseudo inverse, (9) is notdifferentiable (see [11]).For non-convexity consider the theorem that f ( G E ) is convex in G E ∈ R d × r if and only if h ( t ) = f ( t G + G ) isconvex in t ∈ R for any constants G , G ∈ R d × r (see [4]).In order to use the above theorem, consider rank one matrices G = . . .
00 0 . . .
00 0 . . . ... ... . . . . . . and G = . . .
01 0 . . .
00 0 . . . ... ... . . . . . . . G E = ( t G + G ) . Then P G ( t ) = G E ( G TE G E ) † G TE = 1( t + 1) + 1 ( t + 1) ( t + 1) 0 . . . t + 1) 1 0 . . .
00 0 0 . . . ... ... ... . . . . . . . Using basic properties of trace we get, (1 − λ ) J y ( G E ) − λJ s ( G E ) = Tr (cid:2) P G ( t ) B (cid:3) , where the matrix B is given in (12) and we used Lemma 1. Now, represent B as B = b b . . . b d b b . . . b d ... ... . . . b d b d . . . b dd . Thus, Tr (cid:2) P G ( t ) B (cid:3) = b + 2 b ( t + 1) + b − b ( t + 1) + 1 It can be shown that the above function of t is convex only if b = 0 and b = b . On the other hand, if these twoconditions hold, it can be similarly shown that (1 − λ ) J y ( G E ) − λJ s ( G E ) is non-convex by considering a different pair ofmatrices G and G . This implies that (1 − λ ) J y ( G E ) − λJ s ( G E ) is not convex. A.4. Proof of Theorem 3
Theorem 7.
Assume that the number of negative eigenvalues ( β ) of B in (12) is j . Denote γ = min { r, j } . Then, theminimum value in (10) is given as, β + β + · · · + β γ where β ≤ β ≤ . . . ≤ β γ < are the γ smallest eigenvalues of B . And the minimum can be attained by G E = V , wherethe columns of V are eigenvectors corresponding to all the γ negative eigenvalues of B .Proof. Consider the inner optimization problem of (11) in (10). Using the trace optimization problems and their solutionsin [15], we get min G TE G E = I i J λ ( G E ) = min G TE G E = I i Tr (cid:2) G TE BG E (cid:3) = β + β + · · · + β i , where β , β , . . . , β i are i smallest eigenvalues of B and minimum value can be achieved by the matrix V whose columnsare corresponding eigenvectors. If the number of negative eigenvalues of B is less than r , then the optimum i in (10) is j ,otherwise the optimum i is r . B. Empirical Moments Based Solution to Linear Encoder
In many practical scenarios, we only have access to data samples but not to the true mean vectors and covariance matrices.Therefore, the solution in Section 3.2 might not be feasible in such as case. In this Section, we provide an approach to solvethe optimization problem in Section 3.2 which relies on empirical moments and is valid even if the covariance matrix C x isnot full-rank.Firstly, for a given Θ E , we find J y = min W y , b y MSE (ˆ y − y ) . W y , b y . Therefore, for a given W y , we first minimize over b y : min b y E (cid:110)(cid:13)(cid:13) W y Θ E x + b y − y (cid:13)(cid:13) (cid:111) = min b y n n (cid:88) k =1 (cid:13)(cid:13) W y Θ E x k + b y − y k (cid:13)(cid:13) = 1 n n (cid:88) k =1 (cid:13)(cid:13) W y Θ E x k + c − y k (cid:13)(cid:13) where we used empirical expectation in the second stage and the minimizer c is c = 1 n n (cid:88) k =1 (cid:16) y k − W y Θ E x k (cid:17) = 1 n n (cid:88) k =1 y k − W y Θ E n n (cid:88) k =1 x k = E (cid:8) y (cid:9) − W y Θ E E (cid:8) x (cid:9) (22)Let all the columns of matrix C be equal to c . We now have, J y = min W y , b y MSE (ˆ y − y )= min W y n (cid:13)(cid:13) W y Θ E X + C − Y (cid:13)(cid:13) F = min W y n (cid:13)(cid:13) W y Θ E ˜ X − ˜ Y (cid:13)(cid:13) F = min W y n (cid:13)(cid:13) ˜ X T Θ TE W Ty − ˜ Y T (cid:13)(cid:13) F = min W y n (cid:13)(cid:13) MW Ty − P M ˜ Y T (cid:13)(cid:13) F + 1 n (cid:13)(cid:13) P M ⊥ ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) MM † (cid:124) (cid:123)(cid:122) (cid:125) P M P M ˜ Y T − P M ˜ Y T (cid:13)(cid:13) F + 1 n (cid:13)(cid:13) P M ⊥ ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) P M ⊥ ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:13)(cid:13) P M ˜ Y T (cid:13)(cid:13) F where in the third step we used (22), M = ˜ X T Θ TE and the fifth step is due to orthogonal decomposition. Using the sameapproach, we get J s = 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n (cid:13)(cid:13) P M ˜ S T (cid:13)(cid:13) F (23)Now, assume that the columns of L x are orthogonal basis for the column space of ˜ X T . Therefore, for any M , there existsa G E such that L x G E = M . In general, there is no bijection between Θ E and G E in the equality ˜ X T Θ TE = L x G E . But,there is a bijection between G E and Θ E when constrained to Θ E ’s in which R ( Θ TE ) ⊆ N ( ˜ X T ) ⊥ . This restricted bijectionis sufficient to be considered, since for any Θ TE ∈ N ( ˜ X T ) we have M = . Once G E is determined, Θ TE can be obtainedas, Θ TE = ( ˜ X T ) † L x G E + Θ , Θ ⊆ N ( ˜ X T ) . However, since (cid:13)(cid:13) Θ E (cid:13)(cid:13) F = (cid:13)(cid:13) Θ TE (cid:13)(cid:13) F = (cid:13)(cid:13) ( ˜ X T ) † L x G E (cid:13)(cid:13) F + (cid:13)(cid:13) Θ (cid:13)(cid:13) F , choosing Θ = results in minimum (cid:13)(cid:13) Θ E (cid:13)(cid:13) F , which is favorable in terms of robustness to noise.16 lgorithm 1 Spectral Adversarial Representation Learning Input: data X , target labels Y , sensitive labels S , tolerable leakage α min ≤ α tol ≤ α max , (cid:15) Output: linear encoder parameters Θ E L x ← orthonormalize basis of ˜ X T Initiate λ = 1 / , λ min = 0 and λ max = 1 do Calculate B in (24) G E ← eigenvectors of negative eigenvalues of B Θ E ← G TE L Tx ( ˜ X ) † Calculate α using (23) if α < ( α tol − (cid:15) ) then λ min = λ and λ ← ( λ + λ max ) / else if α > ( α tol + (cid:15) ) then λ max = λ and λ ← ( λ + λ min ) / end if while (cid:12)(cid:12) α − α tol (cid:12)(cid:12) ≥ (cid:15) By choosing Θ = , determining the encoder Θ E would be equivalent to determining G E . Similar to (7), we have P M = L x P G L Tx . If we assume that the rank of P G is i , J λ ( G E ) in (11) can be expressed as, J λ ( G E ) = λ (cid:13)(cid:13) L x G E G TE L Tx ˜ S T (cid:13)(cid:13) F − (1 − λ ) (cid:13)(cid:13) L x G E G TE L Tx ˜ Y T (cid:13)(cid:13) F where G E G TE = P G for some orthogonal matrix G E ∈ R d × i . This resembles the optimization problem in (10) and thereforeit has the same solution as Theorem 3 with modified B given by B = L Tx (cid:16) λ ˜ S T ˜ S − (1 − λ ) ˜ Y T ˜ Y (cid:17) L x (24)Once G E is determined, Θ E can be obtained as G TE L Tx ( ˜ X ) † . Algorithm 1 summarizes our entire solution for the constrainedoptimization problem in (8) through the solution of the Lagrangian version in (9). C. Non-linear Extension Through Kernelization x φ ( · ) φ ( x ) E z T ˆy A ˆs Figure 12:
Kernelized Adversarial Representation Learning consists of four entities, a kernel φ x ( · ) , an encoder E thatobtains a compact representation z of the mapped input data φ x ( x ) , a predictor T that predicts a desired target attribute y and an adversary that seeks to extract a sensitive attribute s , both from the embedding z .We assume that x is non-linearly mapped to φ x ( x ) as illustrated in Figure 12. From the representer theorem (see[27]), wenote that Θ E can be expressed as Θ E = Λ ˜ Φ Tx . Consequently the embedded representation z can be computed as, z = Θ E φ x ( x ) = Λ ˜ Φ Tx φ x ( x ) = ΛD T [ k x ( x , x ) , · · · , k x ( x n , x )] T C.1. Learning
First, for a given fixed Θ E , we find J y = min W y , b y MSE (ˆ y − y ) . W y , b y . Therefore, for a given W y , we first minimize over b y : min b y E (cid:110)(cid:13)(cid:13) W y Θ E φ x ( x ) + b y − y (cid:13)(cid:13) (cid:111) = min b y n n (cid:88) k =1 (cid:13)(cid:13) W y Θ E φ x ( x k ) + b y − y k (cid:13)(cid:13) = 1 n n (cid:88) k =1 (cid:13)(cid:13) W y Θ E φ x ( x k ) + c − y k (cid:13)(cid:13) where the minimizer c is, c = 1 n n (cid:88) k =1 (cid:16) y k − W y Θ E φ x ( x k ) (cid:17) = 1 n n (cid:88) k =1 y k − W y Θ E n n (cid:88) k =1 φ x ( x k )= E (cid:8) y (cid:9) − W y Θ E E (cid:8) φ x ( x ) (cid:9) (25)Let all the columns of C be equal to c . Therefore we now have, min W y , b y MSE (ˆ y − y )= min W y n (cid:13)(cid:13) W y Θ E Φ x + C − Y (cid:13)(cid:13) F = min W y n (cid:13)(cid:13) W y Θ E ˜ Φ x − ˜ Y (cid:13)(cid:13) F = min W y n (cid:13)(cid:13) ˜ Φ Tx Θ TE W Ty − ˜ Y T (cid:13)(cid:13) F = min W y n (cid:13)(cid:13) MW Ty − P M ˜ Y T (cid:13)(cid:13) F + 1 n (cid:13)(cid:13) P M ⊥ ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) MM † (cid:124) (cid:123)(cid:122) (cid:125) P M P M ˜ Y T − P M ˜ Y T (cid:13)(cid:13) F + 1 n (cid:13)(cid:13) P M ⊥ ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) P M ⊥ ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:13)(cid:13) P M ˜ Y T (cid:13)(cid:13) F (26)where the third step is due to (25), M = ˜ Φ Tx Θ TE and the fifth step is the orthogonal decomposition w.r.t. M . Using the sameapproach, we get J s = 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n (cid:13)(cid:13) P M ˜ S T (cid:13)(cid:13) F (27)Finding optimal Θ E is equivalent to finding optimal Λ E (since Θ E = Λ E ˜ Φ Tx ) where we would have M = ˜ Φ Tx ˜ Φ x Λ TE =˜ K x Λ TE . Now, assume that the columns of L x are orthogonal basis for the column space of ˜ K x . As a result, for any M , thereexists G E such that L x G E = M . In general, there is no bijection between Λ E and G E in the equality ˜ K x Λ TE = L x G E .But, there is a bijection between G E and Λ E when constrained to Λ E ’s in which R ( Λ TE ) ⊆ N ( ˜ K x ) ⊥ . This restrictedbijection is sufficient, since for any Λ TE ∈ N ( ˜ K x ) we have M = . Once G E is determined, Λ TE can be obtained as, Λ TE = ( ˜ K x ) † L x G E + Λ , Λ ⊆ N ( ˜ K x ) However, since (cid:13)(cid:13) Λ E (cid:13)(cid:13) F = (cid:13)(cid:13) Λ TE (cid:13)(cid:13) F = (cid:13)(cid:13) ( ˜ K x ) † L x G E (cid:13)(cid:13) F + (cid:13)(cid:13) Λ (cid:13)(cid:13) F , Λ = results in minimum (cid:13)(cid:13) Λ E (cid:13)(cid:13) F , which is favorable in terms of robustness to the noise. Similar to (7), we have P M = L x P G L Tx . If we assume that the rank of P G is i , J λ ( G E ) in (11) can be expressed as, J λ ( G E ) = λ (cid:13)(cid:13) L x G E G TE L Tx ˜ S T (cid:13)(cid:13) F − (1 − λ ) (cid:13)(cid:13) L x G E G TE L Tx ˜ Y T (cid:13)(cid:13) F where P G = G E G TE for some orthogonal matrix G E ∈ R d × i . This resembles the optimization problem in (10) and thereforehave the same solution as Theorem 3 with modified B as, B = L Tx (cid:16) λ ˜ S T ˜ S − (1 − λ ) ˜ Y T ˜ Y (cid:17) L x (28)Once G E is determined, Λ E can be computed as G TE L Tx ( ˜ K Tx ) † . Algorithm 1 summarizes our entire solution (replacing ˜ X by ˜ K Tx in steps and ) if one wishes to consider the constrained optimization problem in (8) instead of unconstrainedLagrangian version in (9). It is worth of mentioning that the objective function J λ ( G E ) is neither convex nor differentiable.The proof is exactly the same as Theorem 3. D. Proof of Lemma 4
Lemma 8.
Let the columns of L x be the orthonormal basis for ˜ K x (in linear case ˜ K x = ˜ X T ˜ X ). Further, assume that thecolumns of V s are the singular vectors corresponding to zero singular values of ˜ SL x and the columns of V y are the singularvectors corresponding to non-zero singular values of ˜ YL x . Then, we have γ min = min Θ E J y ( Θ E )= 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:107) ˜ YL x (cid:107) F γ max = min arg max J s ( Θ E ) J y ( Θ E )= 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ YL x V s (cid:13)(cid:13) F α min = max arg min J y ( Θ E ) J s ( Θ E )= 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ SL x V y (cid:13)(cid:13) F α max = max Θ E J s ( Θ E )= 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F . Proof.
Firstly, we recall from Section C that instead of Λ E , we consider G E . These two matrices are related to each otheras, ˜ K x Λ TE = L x G E = M , where the columns of L x are the orthogonal basis for the column space of ˜ K x . Therefore we cannow express the projection on to M in terms of projection onto G , i.e., P M = L x P G L x . Using (26), we get γ min = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n max Θ E (cid:13)(cid:13) P M ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n max G E (cid:13)(cid:13) L x P G E L Tx ˜ Y T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n max i (cid:110) max G TE G E = I i Tr (cid:2) G TE L Tx ˜ Y T ˜ YL x G E (cid:3)(cid:111) = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n Tr (cid:2) V Ty L Tx ˜ Y T ˜ YL x V y (cid:3) = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:88) k σ k = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:88) σ k > σ k = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ YL x (cid:13)(cid:13) F (29)19here the fourth step is borrowed from trace optimization problems studied in [15] and σ k ’s are the singular values of ˜ YL x .In order to better interpret the bounds, we consider the one-dimensional case where x , y , ∈ R . In this setting, the correla-tion coefficient (denoted by ρ ( · , · ) ) between x and y is, ρ ( x , y ) = ˜ Y ˜ X T (cid:112) ˜ Y ˜ Y T ˜ X ˜ X T = (cid:107) ˜ YL x (cid:107) F σ y = (cid:114) − γ min σ y , (30)where σ y = (cid:107) ˜ Y (cid:107) F /n . As a result, the normalized MSE can be expressed as, γ min σ y = 1 − ρ ( x , y ) (31)Therefore, the lower bound of the target’s MSE is independent of the encoder and is instead only related to the alignmentbetween the subspaces spanned by the data and labels.Next, we find an encoder which allows the target task to obtain its optimal loss, γ min , while seeking to minimize the leak-age of sensitive attributes as much as possible. Thus, we constrain the domain of the encoder to { arg min J y ( Θ E ) } . Assumethat the columns of the encoder G E is the concatenation of the columns of V y together with at least one singular vectorcorresponding to a zero singular value of ˜ YL x . Therefore V y ⊆ G E and consequently (cid:107) L x P V y L Tx U (cid:107) F ≤ (cid:107) L x P G L Tx U (cid:107) F for arbitrary matrix U . As a result, J s ( G E ) ≥ J s ( V y ) and at the same time J y ( G E ) = J y ( V y ) . The latter can be observedfrom, (cid:13)(cid:13) L x P G E L Tx ˜ Y T (cid:13)(cid:13) F = (cid:13)(cid:13) ˜ YL x P G E L Tx (cid:13)(cid:13) F = (cid:13)(cid:13) ˜ YL x G E G TE L Tx ˜ Y T (cid:13)(cid:13) F = (cid:13)(cid:13) ˜ YL x V y V Ty L Tx (cid:13)(cid:13) F = (cid:13)(cid:13) L x P V y L Tx ˜ Y T (cid:13)(cid:13) F (32)We then have, α min = 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n (cid:13)(cid:13) L x P V y L Tx ˜ S T (cid:13)(cid:13) F = 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n Tr (cid:2) V Ty L Tx ˜ S T ˜ SL x V y (cid:3) = 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ SL x V y (cid:13)(cid:13) F (33)This bound can again be interpreted under the one-dimensional setting of x , s ∈ R as, α min σ s = 1 − ρ ( x , s ) (34)On the other hand, α max turns out to be, α max = 1 n (cid:13)(cid:13) ˜ S T (cid:13)(cid:13) F = σ s (35)which can be achieved via trivial choice of G E = 0 . However, we let the columns of G E be the singular vectors correspond-ing to all zero singular values of ˜ SL x in order to maximize (cid:13)(cid:13) P M ˜ Y T (cid:13)(cid:13) F and at the same time ensuring that J s ( G E ) equal to α max . As a result, we have γ max = 1 n (cid:13)(cid:13) ˜ Y T (cid:13)(cid:13) F − n (cid:13)(cid:13) ˜ YL x V s (cid:13)(cid:13) F . For the one dimensional case i.e., x , y , s ∈ R , we get V s = 0 and consequently, γ max = σ yy